After both Cloudera and MapR announced their businesses are going through difficult periods a few weeks ago I've seen a stream of "Hadoop is dead" social media posts. These posts are nothing new but in a sector where technical practitioners rarely produce quality social media material these cries are getting louder and louder. I wanted to take a moment to address some of these arguments surrounding the state of Hadoop.
Competing with Free
Cloudera has offerings that help make Hadoop more of a turnkey solution. These tools originated in a time before devops went mainstream and automated deployments were rare to come across.
Their tools provide value to their 2,600+ customers but a lot of software in their offering is open source and available free of charge. Cloudera is ultimately competing with free software. To top that off, a lot of the Hadoop ecosystem developers have worked at Cloudera at one time or another so they end up subsidising the free offerings they compete against.
Because they compete with free Cloudera will never serve 100% of the Hadoop user base. I'd hesitate to use them as an indicator of the health of Hadoop for this very reason.
Other firms that offer turnkey solutions around Spark and Presto have gone out of their way to distance themselves from the Hadoop brand. Their offerings may include hundreds of .jar files from various Hadoop projects but nonetheless, these firms want to do everything they can to avoid competing with free while lowering their development costs by utilising open source software. Sales is hard when your customer can legally download 80% of your offering without paying you for it.
Competing with AWS
In 2012 I worked on a Hadoop implementation with 25 other contractors. Some of my colleagues came from Google, others went on to work for Cloudera. There was a significant budget involved, very little of the Hadoop ecosystem was turnkey and a lot of billable hours were produced by the team.
Within a few years, AWS EMR had emerged and began eating market share. EMR allows you to launch Hadoop clusters with a large variety of software installed with a couple clicks. It can run on spot instances which cut hardware costs by ~80% and can store data on S3 which was, and still is, cheap and has 99.999999999% durability.
Suddenly, the need for 25 contractors on a project was gone. On some projects there might only be myself full-time and a few others part-time preparing infrastructure in addition to our other duties. There still is a need for consultants on projects using AWS EMR but the total billing potential for this sort of work is a lot smaller than a few years ago.
How much of Cloudera's potential business was lost to EMR? Cloudera did a good job at orchestrating setup and managing clusters on bare metal but a good chunk of the data world is on the Cloud these days. It's worth considering how attractive Hadoop is to businesses purely because there is a managed offering with spot instances available on AWS.
What is Hadoop?
If you asked me for the definition of Hadoop I'd say it's a large collection of open source software that, to a certain extent, integrates with one another and shares a number of common libraries. I see Hadoop as a decoupled database, almost like an operating system distribution for data.
Not all software projects under the Hadoop umbrella are Apache projects, Presto is one such exception. Others, like ClickHouse, with forthcoming HDFS and Parquet support, aren't seen by many as a Hadoop project even though they will soon tick the compatibility check box.
Neither Parquet nor ORC files existed prior to 2012. These file formats were instrumental in making analytics fast on Hadoop. Prior to these formats, workloads were largely row-oriented. If you needed to transform TBs of data and could do so in a parallel fashion then Hadoop did a good job of that. MapReduce was a framework often used for this purpose.
What columnar storage offered was a way to analyse TBs of data in seconds. This proved to be a more valuable proposition to more businesses. Data Scientists may only need a small amount of data to produce insights but they'll need to look over a data lake with potentially PBs of data to pick out what they need first. Columnar analytics is key for them to build the data fluency needed to know what to cherry-pick.
MapReduce has two verbs of functionality, map and reduce and it sees data as rows. Spark come about afterword and would use more verbs such as filter and union and would see a data structured as a Directed Acyclic Graph (DAG). These primitives allowed Spark to run more sophisticated workloads like machine learning and graph analytics. Spark still can use YARN as its capacity scheduler, much like how jobs on MapReduce are executed but the Spark team also began bundling in their own scheduler and later added support for Kubernetes.
At some point the Spark community tried to distance itself from the Hadoop ecosystem. They didn't want to be seen as built on legacy software nor as some sort of "add-on" for Hadoop. Given the level of integration Spark has with the rest of the Hadoop ecosystem and given the 100s of libraries from other Hadoop projects being used by Spark I don't subscribe to the belief that Spark is its own thing.
MapReduce might not be the first choice for most workloads these days but it still is the underlying framework when using hadoop distcp, a software package that can transfer data between AWS S3 and HDFS faster than any other offering I've benchmarked.
Is Every Hadoop Tool a Success?
No, there are some projects that have been eclipsed by new comers.
As an example, a lot of workloads that would have been automated with Oozie in the past are now being automated by Airflow. Robert Kanter, Oozie's primary developer, provided the bulk of the code base that exists today. Unfortunately, Robert hasn't made many commits to the project since he left Cloudera in 2018. Meanwhile, Airflow has over 800 contributors, a number that has almost doubled in the last year. Almost every single customer I've worked with since 2015 has used Airflow in at least one department within their organisations.
Hadoop provides various primitives and building blocks that make up a data platform. It's not uncommon for a few projects to compete to provide the same functionality. Eventually a few of these projects lose momentum while others assume the top role.
There were a few projects that were considered "go-to" for various workloads in 2010 that only ever had a few contributors or in some cases a few sizeable deployments. Seeing these projects come and go has been used as evidence the whole Hadoop ecosystem is dying but I do not see it this way.
I see this loose association of projects as a way to develop a lot of powerful functionality that can be used without any significant license fees by end users. It is the survival of the fittest and it proves that more than one approach has been considered to any given problem.
UPDATE: I had originally stated Oozie had 17 contributors based on what is reported by GitHub. Oozie has in fact had both direct commits and patches submitted by 152 developers to date, not just the 17 that appear in GitHub's calculation. Robert Kanter reached out to me following this post's initial publication with evidence of these additional 135 contributors and I thank him for that clarification.
Search Traffic Is Down
One of the arguments given for Hadoop's "demise" is that Google search traffic for various Hadoop technologies is down. Cloudera and a number of other consultancies did a good job of raising funding in years past and put considerable effort into marketing their offerings. This in turn sparked a lot of interest and at one point there was a wave of people in the tech community looking into these technologies. This community is diverse and at some point most people did, as they always will, move on to other things.
At no point in Hadoop's history has there been such a rich variety of features being offered as today and never before has it been so stable and battle-tested.
Hadoop projects are made up of millions of lines of code which have been written by thousands of contributors. In any given week there are 100s of developers working on the various projects. Most commercial database offerings are lucky to have a handful of engineers making any significant improvements to their code bases every week.
Why is Hadoop Special?
First, there are HDFS clusters with 600+ PB of capacity. The in-memory nature of HDFS' metadata means you can happily handle 60K operations per second.
AWS S3 broke a lot of what's found in POSIX file systems in order to achieve scalability. Rapid file modifications, like the kind needed when converting CSV to Parquet files, isn't possible with S3 and requires something like HDFS if you want to distribute the workload. If conversion software was modified to make the above an S3-only workload the data locality trade-offs would likely be significant.
Second, the Hadoop Ozone project aims to provide an S3 API-compatible system that can store trillions of objects on a cluster without the need of a proprietary cloud service. The project aims to have native support for Spark and Hive giving it good integration with the rest of the Hadoop Ecosystem. When released, this software will be one of the first such open source offerings that can store this many files on a single cluster.
Third, even if you're not working with PBs of data the APIs available to you in the Hadoop ecosystem provide a consistent interface for handling GBs of data. Spark is the definitive solution for distributed machine learning. Once you know the APIs, it doesn't matter if your workload is GBs or PBs, the code you produce doesn't need to be re-written, you just need more machines to run it on.
I'd much sooner teach someone how to write SQL and PySpark code then teach them how to distribute AWK commands across multiple machines.
Fourth, a lot of the features of the Hadoop ecosystem are a leading light for commercial vendors. Every failed sales pitch for a proprietary database results in the sales team learning of just how many missing features, compromises and pain points their offering has. Every failed POC results in the sales team learning just how robust their internal testing of their software really is.
No One Needs Big Data
The above projects often aren't advertised in a way that web developers would be exposed to them. This is why someone could spend years working on new projects that are at the bottom of their S-curve in terms of both growth and data accumulated and largely never see a need for data processing outside of what could fit in RAM on a single machine.
Web Development was a big driver in the population growth of coders over the past 25 years. Most people that call themselves a coder are most often building web applications. I think a lot of the skillsets they possess overlap well with those needed in data engineering but often distributed computing, statistics and storytelling are lacking.
Websites often don't produce much load with any one user and often the aim is keep load on servers supporting a large number of users below the maximum hardware thresholds. The data world is made up of workloads where a single query is trying its best to maximize a large number of machines in order to finish as quickly as possible while keeping the infrastructure costs down.
Companies producing PBs of data often have a queue of experienced consultants and solutions providers at their door. I've rarely seen anyone plucked out of web development by their employer and brought into the data platform engineering space; it's almost always a lengthy, self-retraining exercise.
That Dataset Can Live in RAM
I hear people arguing "a dataset can fit in memory". RAM capacity, even on the Cloud, has grown a lot recently. There are EC2 instances with 2 TB of RAM. RAM can typically be used at 12-25 GB/s depending on the architecture of your setup. Using RAM alone won't provide any failure-recovery if the machine suffers a power failure. To add to this, the cost per GB will is tremendous compared to using disks.
Disks are catching up in speeds as well. There was a 4 x 2 TB PCIe 4.0 NVMe SSD card announced recently that could read and write at 15 GB/s. The price point of the PCIe 4.0 NVMe drives will be very competitive with RAM and provide non-volatile storage. I can't wait to see an HDFS cluster with some good networking using those drives as it'll demonstrate what an in-memory data store with non-volatile storage with the rich, existing tooling of the Hadoop ecosystem looks like.
I wouldn't want to spend 6 or 7 figures on designing a data platform and a team for a business that couldn't scale beyond what fits on any one developer's laptop.
In terms of workflow, my days mainly consists of using BASH, Python and SQL. Plenty of new graduates are skilled in the above.
A PB of Parquet data can be nicely spread across one million files on S3. The planning involved with the above isn't much more than considering how to store 100,000 micro-batched files on S3. Just because a solution scales doesn't mean it's overkill.
Just use PostgreSQL?
I've also heard arguments that row-oriented systems like MySQL and PostgreSQL can fit the needs of analytical workloads as well as their traditional transactional workloads. Both of these offerings can do analytics and if you're looking at less than 20 GB of data it's probably not worth the effort of having multiple pieces of software running your data platform.
That being said, I've had to work with a system that was feeding 10s of billions of rows into MySQL on a daily basis. There is nothing turnkey about MySQL and PostgreSQL that lends themselves to handling this sort of workload. The infrastructure costs to keep the datasets, even for just a few days, in row-oriented storage eclipsed the staffing costs. The migration to a columnar storage solution for this client brought down those infrastructure costs by two orders-of-magnitude and sped up querying times by two orders-of-magnitude.
PostgreSQL has a number of add-ons for columnar storage and multi-machine query distribution. The best examples I've seen are commercial offerings. The announced Zedstore could go some way to bringing columnar storage as a standard, built-in feature of PostgreSQL. It'll be interesting to see if single query distribution and storage decoupling become standard features as well in the future.
If you have a transactional need for your dataset it's best to keep this workload isolated with a transactional data store. This is why I expect MySQL, PostgreSQL, Oracle and MSSQL to be around for a very long time to come.
But would you like to see a 4-hour outage at Uber because one of their Presto queries produced unexpected behaviour? Would you like to be told your company needs to produce invoices for the month so the website will need to be switched off for a week so there is enough resources available for the task? Analytical workloads don't need to be coupled with transactional workloads. You can lower operational risks and pick better suited hardware by running them on separate infrastructure.
And since you're on separate hardware you don't need to use the exact same software. Many skills that make a competent PostgreSQL Engineer lend themselves well to the analytics-focused data world; it's less of a leap than that of a web developer moving into the Big Data space.
What does the future look like?
I expect to continuing analysing and widening my skillset in the data space for the foreseeable future. In the past 12 months I've delivered work using Redshift, BigQuery and Presto, almost in even amounts. I try to spread my bets as I've yet to find a working crystal ball.
One thing I do expect is more fragmentation and more players to both enter and crash out of this industry. There is a reason for most databases to exist but the use cases they can serve can be limited. That being said, good sales people can go some way to extending the market demand for any given offering. I've heard people estimate it would take $10M to produce a commercial-quality database which means this is probably the sweet spot for venture capital.
There are plenty of offerings and implementations out there that leave customers with a bad taste in their mouth. There is such a thing as Cloud sticker shock. There are solutions which are great but very expensive to hire expertise for. Arguing the trade-offs above will keep the sales and marketing people in the industry busy for some time to come.
Cloudera and MapR might be going through a hard time right now but I've heard nothing to make me believe it's anything other than sunshine and roses at AWS EMR, DataBricks and Qubole. Even Oracle is releasing a Spark-driven offering. It would be good for the industry to see Hadoop as more than just a Cloudera offering and acknowledge that the above firms, as well as Facebook, Uber and Twitter have all made significant contributions to the Hadoop world.
Hortonworks, which merged with Cloudera this year, are the platform providers for Azure HDInsight, Microsoft's managed Hadoop offering. The company has the people that can deliver a decent platform to a 3rd-party Cloud provider. I hope whatever offerings they're working on for the future are centred on this sort of delivery.
I suspect Cloudera's early customers were users of HBase, Oozie, Sqoop and Impala. It would be good to see these not compete for so much engineering time and for future versions of their platform to come with Airflow, Presto and the latest version of Spark out of the box.
At the end of the day, if your firm in planning on deploying a data platform there is no replacement for the astute management team that can research diligently, plan carefully and fail quickly.