By Neil Raden Hired Brains Research firstname.lastname@example.org
Hadoop “data warehouses” do not resemble the data warehouse/analytics that are common in organizations today. They exist in businesses like Google and Amazon for web log parsing, indexing, and other batch data processing, as well as for storing enormous amounts of unfiltered data. Petabyte-size data warehouses in Hadoop are not data warehouses as we know them; they are a collection of files on a distributed file system designed for parallel processing. To call these file systems “a data warehouse” is misleading because a data warehouse exists to serve a broad swath of uses and people, particularly in business intelligence, which is both interactive and iterative.
MapReduce is a programming paradigm with a single data flow type that takes the form of directed acyclic graph of operators. These platforms lack built-in support for iterative programs, quite different from the operations of a relational database. To put it in layman’s terms, there are things that Hadoop is exceptionally well designed for that relational databases would struggle to do. Conversely, a relational database data warehouse performs a multitude of useful functions that Hadoop does not yet possess. Hadoop is described as a solution to a myriad of applications in web log analysis, visitor behavior, image processing, search indexes, analyzing and indexing textual content, for research in natural language processing and machine learning, scientific applications in physics, biology and genomics and all forms of data mining. While it is demonstrable that Hadoop has been applied to all of the domains and more, it is important to distinguish between supporting these applications and actually performing them. Hadoop comes out of the box with no facilities at all to do most of this analysis. Instead, it requires the application of libraries available either through the open source community at forge.com or from the commercial distributions of Hadoop, or by custom development by scarce programmers. In no case can these be considered a seamless bundle of software that is easy to deploy in the enterprise. A more accurate description is that Hadoop facilitates these applications by grinding through data sources that were previously too expensive to mine. In many cases, the end result of a MapReduce job is the creation of a new data set that is either loaded into a data warehouse or used directly by programs such as SAS or Tableau.
The MapReduce architecture provides automatic parallelization and distribution, fault recovery, I/O scheduling, monitoring, and status updates. It is both a programming model and a framework for massively parallel processing of large datasets in batch across many low-end nodes. Its ability to spread very large jobs across a cluster of ordinary servers is perhaps its best feature, certainly its most unique feature. In addition, it has excellent retry/failure semantics. MapReduce at the programming level is simple and easy to use. Programmers code only Map() and Reduce() functions and are not involved with how the job is distributed. There is no data model, and there is no schema. The subject of a MapReduce job can be any irregular data. Because the assumption is that MapReduce clusters are composed of commodity hardware, and there are so many of them, it is normal for faults to occur during a job, and Hadoop handles a few faults automatically, shifting the work to other resources. But there are some drawbacks. Because MapReduce is a single fixed data flow, has a lack of schema, index and high-level language, one could consider it a hammer, not a precision machine tool. It requires data parsing and fullscan in its operation; it sacrifices disk I/O to avoid schemas, indexes, and optimizers; intermediate results are materialized on local disks. Runtime scheduling is based on speculative execution, considerably less sophisticated than today’s relational analytical platforms. Even though Hadoop is evolving, and the community is adding capabilities rapidly, it lacks most of the security, resource management, concurrency, reliability and interactive capabilities of a data warehouse. Hadoop’s most basic components – the Hadoop Distributed File System (HDFS) and MapReduce framework – are purpose built for understanding and processing multi-structured data. The file system is crude in comparison to a mature relational database system which when compared to the universal use of SQL is a limiting factor. However, its capabilities, which have just begun to be appreciated, override these limitations and tremendous energy is apparent in the community that continues to enhance and expand Hadoop.
Hadoop MapReduce with the HDFS is not an integrated data management system. In fact, though it processes data across multiple nodes in parallel, it is not a complete massively parallel processing (MPP) system. It lacks almost every characteristic of an MPP system, with the exception of scalability and reliability. Hadoop stores multiple copies of the data it is processing, and the failure of a node can rollover to another node with the same data, though there is also a single point of failure at the HDFS Name Node, which the Hadoop community is looking to address in the long term (Today, NetApp provides a hardware-centric fail-over solution for the Name Node). It lacks security, load balancing and an optimizer. Data warehouse operators today will find Hadoop to be primitive and brittle to set up and operate, and users will find its performance lacking. In fact, its interactive features are limited to a pseudo-relational database, Hive, whose performance would be unacceptable to those accustomed to today’s data warehouse standards. In fairness, MapReduce was never conceived as an interactive knowledge worker tool, and the Hadoop community is making progress, but HDFS, which is the core data management feature of Hadoop, is simply not architected to provide the services that relational databases do today. And those relational database platforms for analytics are innovating just as rapidly with:
• Hybrid row and columnar orientation.
• Temporal and spatial data types.
• Dynamic workload management.
• Large memory and solid-state drives.
• Hot/warm/cold storage.
• Almost limitless scalability.
The ability to provide almost endless scalabilty and parallelism for batch jobs is a unique distinction for Hadoop. The only platforms that were previously able to provide this sort of massive parallelism were relational databases, and they are not limited to batch operation. So what happens next? My guess is that Hadoop survives and flourishes as the first responder to incoming data, making sense of it and handing it off to other proceses, including data warehouses, in whatever form they take. However, unless petabytes of historical data are needed for interactive analysis, Hadoop will be the favored location for storing history. The Hadoop community, and its imitators and competitors will play an important role in analytics, but not the only role.