Introduction to Big Data Frameworks for Beginners: Under the Hood of Hortonworks and Cloudera
As the Internet of Things (IoT) becomes a part of everyday life with more data being collected than ever before, there is an increasing need for easily handling big data. However, easy demonstration of the benefits of big data storage and analysis has always been complex, requiring a large amount of background domain-specific knowledge. This typically included file system architecture(s), SQL coding paradigms, database and table structures, and analytical coding skills. It is not for the faint-hearted! Recent innovations in well-established frameworks and technologies that use web-based front-ends (and hence are very familiar to most people) are proving to be very popular gateways into big data analysis and data management.
This post discusses several of the common technologies across two of the more popular frameworks: the Hortonworks Data Platform (HDP) and Cloudera Distribution including Apache Hadoop (CDH). A third framework – the MapR Converged Data Platform (MCDP) – is also available, but has more proprietary components, so will not be discussed further. Both systems are simple to use, yet robust and secure, Hadoop distributions (with lots of bells and whistles on top, some proprietary and some not). That’s quite a technological mouthful, so let’s break that down into component parts, focusing on the two most important pieces for audiences coming from a Windows Server environment and SQL Server relational-database background.
Hadoop is an open-source framework created by the Apache team for large-scale data storage and data processing, optimized for clusters of off-the-shelf commodity computers. Under the hood, the ecosystem is complex, but the folks at Apache have gone to great lengths to build modules and applications that easily configure and control the system’s parts, including:
- The Hadoop Distributed File System, or HDFS – the fault-tolerant file system that turns a cluster into a scalable storage pool
- YARN – the data operating system for process scheduling and resource management
- The computation/processing framework, which is MapReduce (on HDP) or Spark (on CDH) for parallel processing of large datasets. The main difference between these two is where the data is stored: MapReduce stores data on disk (and is fault tolerant through replication), and Spark stores data in-memory (and is fault tolerant through Resilient Distributed Datasets, or RDD).
A simple everyday analogy to this would be to think about our transportation infrastructure; the system moves people (our data points) on bikes, cars, vans, trucks, and semi-trailers from one place to another. The flow of traffic is controlled by streets, stop signs, traffic signals, etc. You can think of HDFS as the freeway in this analogy (high volume in a short space of time), and YARN as the traffic signals that allow people to move in different directions; but these traffic lights are smart and understand what to do when accidents or backups occur. You know those occasions when you are driving downtown and all the traffic signals turn green in a row? That’s what YARN is like for data flow! Finally, you can think of MapReduce or Spark as a mass transit system: high volumes of people flowing in an efficient and planned manner.
So, We Have the Empty Freeway, but What About Driving?
If the Hadoop framework is the foundation for a big data processing system, what is the actual data loading, querying, and how do we see what is happening? There are multiple open-source Apache projects to address this concern, many of which are packaged in the HDP and CDH platforms, but too many to discuss in a simple blog post. See the summary table below. We will concentrate on two in this blog post:
- Data querying: both frameworks use Hive
- Administration/management dashboard: HDP uses Apache Ambari and CDH uses the Cloudera Manager
Table 1: Comparison of HDP and CDH platforms, as of 2016-06-21. When the same technology is used, the table cells are colored green, and when a different technology is used, the table cells are colored yellow.
The Buzz Around Hive
Hive is the data warehousing software for reading, writing, and managing datasets that live in distributed storage file systems (i.e. Hadoop’s HDFS). It is similar to the standard Microsoft SQL Server relational database syntax, using SQL-like terms as the data query language, and as such can be used for typical data warehousing tasks like Extract-Transform-Load, ad-hoc reporting, and data analysis. It was originally developed at Facebook, but given to the Apache foundation and converted to an open-source project. Hive has a few unique features, including projecting a relational data structure onto data already in storage, handling multiple data formats, an extensible User Defined Functions (UDF) system, and providing different mechanisms for query execution specifically designed for large-scale data processing. For those familiar with writing SQL queries, there is very little difference in the actual syntax used, although the differences are important. Table 2 lists some of the standard SQL commands and their equivalent in the Hive Query Language (HQL).
Table 2: Abridged comparison of SQL and Hive queries
Hive was created for long-period in-series scans on a big data (Hadoop) stack. This means queries can have high latency of several minutes (i.e. each query takes a long time). Each part of the command is converted by Hive into MapReduce jobs that run across the Hadoop cluster. This means that Hive is not applicable for applications that require very fast response times or that require a high volume of operations that write to disk (for example, transactional systems like e-commerce applications). But, it is applicable for crunching through very large volumes of data in a very methodical way. But, how do we execute these commands on our HDP or CDH framework?
Meet Ambari and/or the Cloudera Manager: Your Driver’s Dashboard
So far, we have discussed several of the component parts that together form a modern, distributed, big data computing system. But, how do we actually see and manage what is happening from a user’s perspective? Specifically, if we are crunching a large amount of data, how do we know the status of a query or request? Did it succeed or fail? If the latter, why did it fail? Both frameworks have a simple web-based administrative interface for controlling the whole Hadoop ecosystem with clickable buttons, drop-down menus, and text fields for entering queries. See the screenshots below.
Figure 1: The web-based Ambari administrative interface to HDP
Figure 2: The web-based Cloudera Manager administrative interface to CDH
The systems are comparable in functionality, and customizable for your individual big data needs. The key interface features for any big data system are monitoring the metrics associated with the data crunching you are undertaking. As you can see from the screenshots, both frameworks have simple displays illustrating status across component parts of the Hadoop stack.
The Benefits of a Complete Framework
Until relatively recently, you would have needed a skilled team of system administrators, developers, and analysts to take advantage of a fully functional Hadoop ecosystem. However, this is no longer the case. Both frameworks contain everything that you need for big data analysis, from the HDFS file system, the YARN data operating system, the Hive data access/processing layer, the analytical tools for interacting with the data, and the over-arching administrative user interface. If you want to take these big data Cadillacs for a test drive, download the respective Virtual Machine (VM): (Important Note: These are multi-gigabyte files, so avoid your local coffee shop wireless download!).
Once installed, you will have a single-node demo cluster on your computer (I am running it from my Lenovo laptop using Windows 10 with the free Oracle VM VirtualBox). Both frameworks also have an excellent repository of tutorials and online discussion rooms to get your started on your Big Data journey.
This post is the first in a Big Data series exploring the Hadoop ecosystem. The next post will introduce and dive into details of Spark, one of the most popular Big Data processing engines.