As I sometimes do, I went to Boulder last week to soak up some of Claudia and Dave Imhoff’s hospitality and to sit in on a BBBT (Boulder BI Brain Trust) briefing in person instead of remotely like most of us do. The company this particular week was Cloudera and I wanted to not only listen to their presentations, but participate in the Q&A give and take as well as have more intimate conversations at dinner the night before. Despite the fact it took eight hours to drive there from Santa Fe (but only six back), it was clearly worth the effort. I certainly enjoyed meeting all the Cloudera people who came, but since this article is about Hadoop, not Cloudera, I’ll skip the introductions.
A common refrain from any Hadoop vendor (the term vendor is a little misleading because the open source Hadopp is actually free), is that Hadoop, almost without qualification, is a superior architecture for analytics over its predecessor, the relational database management systems (RDBMS) and its attendant tools, especially ETL (Extract/TransformLoad, more on that below). Their reasoning for this is that it is undeniably cheaper to load Hadoop clusters with gobs of data than it is to expand the size of a licensed enterprise relational database. This is across the board – server costs, RAM and disk storage. The economics are there, but only compelling when you overlook a few variables. Hadoop stores three copies of everything, data can’t be overwritten, only appended to, and most of the data coming into Hadoop is extremely pared down before it is actually used in analysis. A good analogy would be that I could spend 30 nights in a flophouse in the Tenderloin for what it would cost for one night at the Four Seasons.
But I did say I was getting convinced about Hadoop, so be patient.
A constant refrain from the Hadoop world is that is difficult and time-consuming to change a schema in a RDBMS, but Hadoop, with its “schema on read” concept allows for instantaneous change as needed. Maybe not intentionally, but this is very misleading. What is hard to change in a RDBMS is making changes to an application such as a DW with upstream and downstream dependencies. I can make a change to a database in two seconds. I can add non-key attributes to a Data Warehouse dimension table in an instant. But changing a shared, vetted, secure application is, reasonably, not an instantaneous thing, which illustrates something about the nature of Hadoop applications – they are not typically shared applications. Often, they are not even applications, so this comparison makes no sense. Instead, it illustrates two very important qualities of RDBMS’ and Hadoop.
One more item about this “hard to change” charge. Hadoop is composed of the file system, HDFS and the programming framework MapReduce. When Hadoop vendors talk about the flexibility and scalability of Hadoop, they are talking about this core. But today, the Hadoop ecosystem (and this is just the Apache open source stuff, there is an expanding soup of add on’s appearing everyday) has more than 20 other modules in the Hadoop stack that make it useful. While I can do whatever I want with the core, once I build applications with these other modules there are just as many dependencies up and down the stack that need to be attended to when changing things as in a standard Data Warehouse environment.
But wait. Now we have the Stinger Initiative for Hive, Hadoop’s SQL-ish database, to make Hive 100x faster. This is accomplished by jettisoning MapReduce and replacing it with Tez, the next-generation MapReduce. According to Hortonworks, Tez is “better suited to SQL” The Stinger initiative also includes ORCfile file for better compression, vectorizing Tez so that, unlike MapReduce, it can grab lots of records at once. And on top of it all, the crown jewel in any relational database, a Cost-Based Optimizer (CBO) which can only work with a, wait for this, schema! In fact, in the demo I saw today from Hortonworks, they were actually showing iterative SQL queries against, again, wait for it…a STAR SCHEMA! So what happened to schema on read? What happened to how awful RDBMS was compared to Hadoop? See where this is going? In order to sell Hadoop to the enterprise, they are making it work like a RDBMS.
There are four kinds of RDBMS’s in the market today (and this is my market definition, no one else’s): 1) Enterprise Data Warehouse database systems designed from the ground up for data warehousing. As far as I’m concerned, there is only one that can handle massive volumes, huge mixed workloads broad functionality, tens of thousands of users a day – Teradata 6xxx series; 2) RDBMS designed for transactional processing, but positioned for data warehousing too, just not as good at it such as Oracle, DB2 and MSSQL; 3) Analytical databases, either sold as software-only or as appliances – IBM Netezza, H-P Vertica, Teradata 2xxx series; 4) In-memory databases such as SAP HANA, Oracle Times Ten and passel of others. Now we have a fifth – SQL-compliant (not completely) databases running on top of HDFS. There are more versions of these, too, such a SpliceMachine, now in public beta, as well as Drill, Impala, Presto, Stinger, Hadapt and Spark/Shark to name a few (although Daniel Abadi of Hadapt has argued that “Structured” query language misses the point of Hadoop entirely – flexibility). Now Hadoop is sort of five.
So where are we going with this? Like Clinton in the 90’s it’s clear Hadoop is moving to the center. Purist Hadoop will continue to exist, but market forces are driving it to a more palatable enterprise offering. Governance, security, managed workloads, interactive analysis. All of the things we have now except cheap platforms for greater volumes of data and massive concurrency.
I do wonder about one thing, though. The whole notion of just throwing more cheap resources at it has to have a point of diminishing returns. When will we get to the point that Hadoop is working 100X or 1000x more resources than would be needed in a careful architecture? Think about this. If we morph Hadoop into just a newer analytical database platform, sooner or later someone is going to wonder why we have 3 petabytes of drives and only 800 terabytes of data. In fact, how much duplication is in that data? How much wasted space? Drives may be cheap, but even a thousand cheap drives cost something, especially when they’re only 20% utilized.
Hadoop was invented for indexing search and other internet-related activities, not enterprise software. It’s promotion to all forms of analytics is curious. Where did anyone prove that its architecture was right for everything, or did the hype just get sold on being cheap? And what is the TCO over time vs a DW?
And when Hadoop venders say, “Most of our customers are building an Enterprise Data Hubs or (a terrible term) Data Lakes next to their EDW because they are complementary, it begs the question, for analytics in typical organizations, what exactly is complementary? That’s when we hear about sensors and machine-generated data and the social networks. How universal are those needs?
Then there is ETL. Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? They need to be reminded that writing code is not quite the same as an ETL tool with versioning, collaboration, reuse, metadata and lots of existing transforms built in. It’s also a little contradictory. If Hadoop is for completely flexibile and novel analysis, who is going to write ETL code for every project? Now there is a real latency: only five minutes to crunch the data and 30 30 days to write the ETL code.
They talk about using Hadoop as an archive to get old data out of a data warehouse, but they fail to mention that that data is unusable with the context that still remains in the DW; nor will it be usable in the DW later after the schema evolves. So what they really mean is use Hadoop as a dump for data you’ll never use but can’t stand to delete, because if you don’t need it in the DW, why do you need it at all?
Despite all this, it’s a tsunami. The horse had left the stable. The train has left the station. Hadoop will grow and expand and probably not even be recognizable as the original Hadoop in a few years and it will replace the RDBMS as the platform of choice for enterprise applications (even if the bulk of application of it will be SQL-based). I guarantee it. So get on top of it or get out of the way.