In December O'Reilly published Architecting Modern Data Platforms, a 636-page guide to implementing Hadoop projects in enterprise environments. The book was written by four authors, Lars George, Jan Kunigk, Paul Wilkinson and Ian Buss, all of whom either have worked or are currently working at Cloudera.
Cloudera has over 2,700 customers using its Hadoop platform offering and consulting services. 74 of these customers spend more than a million dollars every year with Cloudera. This puts Cloudera's staff in a unique position to discuss the key issues to consider when putting together the architecture of an enterprise data platform.
This book is aimed at IT Managers, Architects, Data Platform Engineers and System Administrators. If your role is supporting a single relational database which lives on a single server this book will do a good job at exposing you to a world where you can increase performance and reliability by adding more computers to your infrastructure. The book is less aimed at Web and Mobile Developers, Project Managers, Product Owners and other roles you might find on data projects. If you're a Data Scientist less interested in the underlying platform and its security then this book might not be for you.
Do note, the chapters in my physical book I ordered from Amazon's UK site don't match those listed on the O'Reilly product page. I have 19 chapters, not 21 and the chapter titles stop matching from chapter 14 onward.
Foreword & Preface
The book starts out with a foreword from Mike Olson, one of the founders of Cloudera, where he discusses how many of the concepts in Hadoop are decades old and only really came to life after Google's need for distributed compute and storage of massive datasets of the web's contents resulted in academic papers which inspired various Hadoop tooling authors.
The book's preface discusses how Big Data solutions have often had to sacrifice some features found in conventional relational databases in order to meet various scaling criteria. It also removes the misconception that many of these tools are schema-less by explaining the concept of "schema on read". The notions that there should be one copy of a dataset and one cluster that it should run forever are also dismissed.
In describing horizontal scaling it's clarified that although commodity hardware is used for infrastructure this doesn't mean the cheapest computers available. It's emphasised that highly efficient networks and storage systems are still key to taking full advantage of the software on offer.
Mike also reminds the reader that the authors of the book are not only practitioners in implementing Hadoop systems, they're also active participants in the open source community as well.
Part I. Infrastructure
The first chapter goes into further detail on how the architecture of Hadoop was inspired by academic papers published by Google in the early 2000s. It then discusses how various Hadoop ecosystem tools share data between one another and how they can control and support one another as well. HDFS, YARN, ZooKeeper, Hive and Spark are central in these descriptions.
There are also lengthy discussions around Impala, HBase and Oozie. The above tools enjoy good support from Cloudera but I'm disappointed to not see more than a few lines describing Presto, Cassandra and Airflow.
Chapter two's topic is cluster infrastructure. It discusses the need for deploying multiple clusters in order to allow for independence and different versions of software to be used. It also discusses the benefits of decoupling storage from compute. There might be a conception that any and all Hadoop tooling you intend to use should be installed on the same cluster but there is a suggestion that HBase and Kafka could live happily on their own clusters and use hardware more tuned to their needs while benefiting from isolated CPU- and disk caches. Multi-tenancy is also discussed in this chapter as well.
Chapter three goes into detail around the Linux system calls being made by various Hadoop tools and what their performance characteristics are like on the underlying CPU, memory, various NUMA configurations, storage devices and file systems. There is mention of various Intel-specific hardware optimisations that Hadoop takes advantage of both for data redundancy and security.
HDFS' architecture and behaviour patterns are described very well and this is probably the best written description of the technology I've come across. Erasure coding and replication is described in detail and is accompanied by detailed diagrams. RAID as a Hadoop anti-pattern is explained well and they don't fail to mention that underlying metadata stores could see resiliency benefits from being stored on certain RAID configurations.
There are several pages going over various storage options and typical server inventory part lists. Page 97 has a chart with compute and I/O intensity levels on the axes and plots of 20 workloads showing where they sit relative to one another on these two metrics. For example, data cleansing is neither compute nor storage intensive, graph processing is very compute intensive while needing very little I/O, sorting is I/O intensive and shouldn't be bottlenecked by compute capacity and large SQL joins can often be both compute- and I/O-bound.
There is a discussion around what roles various servers could play in small, medium and large clusters. Pages 102 and 103 contain diagrams suggesting which servers within two racks should house which Hadoop services in a hypothetical 200-node cluster.
Chapter four focuses on networking. It starts out explaining how Hadoop's tools use remote procedure calls (RPC) for monitoring, consensus and data transfers. There is a table on page 108 describing the client-server and server-server interactions for ZooKeeper, HDFS, YARN, Hive, Impala, Kudu, HBase, Kafka and Oozie.
Latency and how it does or doesn't affect various systems within Hadoop is discussed on page 109. Data shuffles are discussed over three pages with two helpful diagrams. Consensus and quorum-based majority voting systems and a lengthy discussion on networking topologies complete the chapter.
Chapter five discusses the various roles that can exist in a Hadoop project. It gives an example scenario of a typical business intelligence project. There are helpful diagrams outlining what sort of width and depth of the skill spectrum would be expected of various roles including Architects, Analysts, Developers and Administrators.
Chapter six covers data centre considerations. This book does discuss the Cloud at length but that's later on in the book. Cloudera has a sizeable number of clients running their offerings on bare metal so they're in a good position to give opinion and guidance on what to look out for when you're renting or buying the hardware your Hadoop cluster runs on.
Part II. Platform
Chapter seven discusses operating system configuration considerations to make when setting up a cluster. SELinux, Firewalls and Containers and their relationship with various Hadoop tools are discussed. This is one of the few chapters that has command line examples and they're geared towards those running Red Hat Enterprise Linux.
Chapter eight discusses platform validation. It goes into detail around smoke, baseline and stress testing the hardware you're using for your cluster. Both disk caches and network latency are well explained in this chapter. There is short mention of benchmarks like TPC-DS.
Chapter nine focuses on security. In-flight encryption, authentication and authorisation as well as how these are addressed in Hadoop's KMS, HDFS, HBase, Kafka, Hive, Solr, Spark, Kudu, Oozie, ZooKeeper and Hue.
Kerberos is very well described which is very rare to come across. It explains how one Kerberos KDC can be configured to trust another without the other needing to reciprocate that trust as well as how long-running applications can be setup to use Keytabs as Kerberos tickets are time-limited.
Kerberos expertise is very expensive to bring in on a consulting basis and there isn't anything that handles authentication across as many Hadoop projects as Kerberos does. Enterprises don't just open up Hadoop clusters to their entire internal network nor do they commonly run air-gapped environments so this chapter alone pays huge dividends.
There is also a long discussion on encryption at rest and the trade-offs between full volume encryption, HDFS transparent data encryption and other service-specific configurations.
Chapter ten discusses integration with identity management providers. There is a lot of discussion around Kerberos in this chapter as well as Hadoop-specific certificate management with configuration examples and example OpenSSL commands.
Chapter eleven discusses how to control access to cluster software. There are tables detailing which pieces of software support REST, Thrift, JDBC, ODBC and which support a Web UI. Various access topologies are offered as inspiration of how one might want to setup their specific implementation. Proxies and Load Balancers are discussed at length as well.
Chapter twelve discusses high availability. This chapter is 47 pages long and goes into setting up redundancy in hardware and software using both active-active and active-passive setups for most Hadoop-related software and their dependencies.
Chapter thirteen discuses backups. There are a lot of different systems in a Hadoop that store data and state and each need to be taken into consideration when planning how you'll back up a whole cluster.
If you were to restore a non-trivial amount of data you'd need to consider how quickly the data could be transferred from another location and what sort of bottlenecks hardware might impose. There is also discussion around taking snapshots versus replicating changes and whether or not you should consider replicating deletes as well as additions to your system.
It would have been nice if Cloudera got one of their petabyte-plus clients to sign off on a short case study on how they've setup and had to use one of their backups.
Part II. Taking Hadoop to the Cloud
Chapter fourteen discusses virtualisation. There is an important concept discussed called "Anti-Affinity Groups". The idea is that you don't want to place cluster nodes that are meant to complement high availability on the same physical machine. If all the ZooKeeper nodes in a quorum are on one physical server and that machine or its rack goes down then there will be total loss of consensus. Likewise, if you have 20+ hard drives connected to your cluster via a single physical cable then scaling I/O horizontally could be bottlenecked and all storage will share a single point of failure.
Chapter fifteen discusses solutions for private cloud platforms. There is a lot in this chapter trying to persuade readers to not try and reinvent their own version of Amazon EMR. OpenStack and OpenShift are discussed as well as life cycles, automation and isolation. This is one of the shortest chapters in the book.
Chapter sixteen discusses solutions for public cloud platforms. This chapter goes over the managed Hadoop offerings from Amazon Web Services, Microsoft Azure and Google Cloud. They go over storage and compute primitives and then discuss ways of setting up high availability.
With these services it can be difficult to know if two nodes in your cluster live on the same physical machine but they suggest using different instances sizes as a way of trying to avoid having a common underlying machine. There are a few other suggestions for trying to get different physical servers for any one cluster.
All three cloud providers offer blob storage as a way of decoupling compute and storage but they aren't completely comparable to one another. The major differences are discussed.
Default service limits are also highlighted. If you're planning on setting up a large Hadoop cluster or many small ones there are a lot of default account limits you'll have to request increases for from your provider.
Page 474 shows a chart with compute capacity on one axis and memory capacity on the other. Across the chart they've plotted where various workloads would sit as well as where various compute instance types of AWS, GCP and Azure would live. For example, Complex SQL predicates are both compute- and memory-intensive and AWS' c5.9xlarge, GCP's n1-highcpu-64 and Azure's F64s_v2 might be good candidates for running this sort of workload. This reminds me of Qubole's "Presto performance across various AWS instance types" blog post.
This chapter does mention Cloudera CDH and Hortonworks HDP (which is used by Azure's HDInsight) on a few occasions as well.
Chapter seventeen covers automated provisioning. They make a good suggestion early on that Kerberos and bind as a well as IAM identities and firewall rules should probably be setup separately to any "Hadoop infrastructure as code" / PaaS provisioning. They also discuss considerations for safely scaling your clusters down when demand fades.
There is a section on "transient clusters" where a user submits a job as a workload and a cluster is provisioned for that job and destroys itself afterwards. There isn't a lot of detail into how the architecture of this specific setup would work but they do mention the three big cloud providers, their Cloudera Altus offering and Qubole as providers that could help with this sort of workload.
Though this chapter focused more on listing features you might want to think about in your implementation rather than giving concrete code examples, Puppet, Ansible and Chef are mentioned early on.
Chapter eighteen discusses security in the cloud and has an excellent introduction. It states not knowing where your data is can be disconcerting. They discuss risk models, identity, connectivity and key management at length. They go so far as to state or try and guess the underlying hardware providers for the three major cloud vendors' key vault solutions.
There's discussion of service accounts for using Google Cloud Storage via HDFS' CLI and generating temporary security credentials for AWS S3 via the AWS CLI. There's also a reasonable amount of content discussing GDPR, US national security orders and law enforcement requests.
One great tip mentioned is that when you're using HDFS' encryption at rest and you delete someone's details from a dataset on a cluster, their data was possibly only unlinked from metadata describing where it is on the disk, not fully scrubbed away. If the disks go missing but without the metadata then the original data will only exist in as an encrypted blob on the physical disks and won't be recoverable without the original metadata and cryptographic keys.
The above could go some way to making an implementation compliant with GDPR's right to be forgotten with little effort being put into the infrastructure setup. Typically databases unlink rather than scrub so "deletes" aren't deletes and the original data might be recoverable.
I think there was a lost opportunity to discuss the threat of uncovering secrets in the history of git repositories used by an implementation team. It would be great to see a git bisect command example using Lyft's high-entropy-string to see if any developers accidentally committed credentials, removed them with another commit and failed to rotate or destroy those credentials afterwards.
Final Thoughts
I'll first mention a few complaints about this book.
First complaint, the sales pitch for Hadoop's strengths is missing from this book. If you're pitching an architecture to a client you need to have arguments for why your software choices are better than anything else out there. Hortonworks staff have publicly stated that there are 600 PB+ HDFS clusters in operation. It's likely that only Google has authored software that supports single clusters with a capacity greater than this. The in-memory nature of a HDFS Name Node means it's possible to support 60K concurrent HDFS requests. A few pages on why Hadoop is still a good idea in 2019 would have been greatly appreciated.
Second complaint, this book doesn't pick-and-mix much outside of Cloudera's core software library they commercially support. There are open source projects not all affiliated with the Hadoop brand that can greatly enhance a Hadoop setup. A lot of firms that aren't Hadoop-focused businesses build a lot of useful tools for the ecosystem. It's a shame to not see more mentioned of Facebook, Uber, AirBNB and Netflix-lead projects in this book.
EMR, Dataproc and HD Insights are covered but other providers, like Confluent, Databricks and Qubole either aren't mentioned or only very sparingly. Hortonworks, which merged with Cloudera in 2018, are mentioned when they offer a competing solution.
Hortonworks helped out Facebook a lot with the ORC file format research they needed a few years back and since Cloudera merged with Hortonworks last year I can't see how this case study can't be owned and proudly discussed in one of the very few Hadoop books of 2018.
If the scope of the book needs to be limited then I can understand that but I've had enterprise clients comparing columnar file formats, asking how many files on HDFS will be too much and wanting some assurances that a reasonable number of providers have been considered to some extent.
A lot of companies that compete with one another also work together on all of the open source software in the Hadoop ecosystem. It would be nice to see if the other firms could volunteer an engineer or three to offer some expertise for comparisons. There aren't a lot of good Hadoop books released so it's an important occasion to try and make it as complete as possible.
All the above said, Netflix's Iceberg project was mentioned in chapter 19 which was nice to see but it would be nicer to see a larger catalogue of these complementary projects.
Third complaint, there should have been a chapter on orchestration. Apache Airflow has the momentum and the huge addition of value that Spark had a few years ago. I've never worked on a data project that didn't need to move data around in an observable fashion. Orchestration is mentioned but not enough given how central it is to data platforms.
Fourth complaint, there's no "Future Features of Hadoop" chapter in this book. Ozone will help break the 500 million file count barrier in HDFS and potentially allow for trillions of files of all sizes to be stored using HDFS primitives. This system will have an AWS S3 API-compatible interface which will make it easy to develop for and add support to existing applications. A third of the commits to the Hadoop git repository over the past few months have been Ozone-specific and I suspect this could be the biggest new feature of Hadoop this year.
With the complaints out of the way...
If you're only going to buy a single book and then supplement its teachings by going through JIRAs, examine source code, read blogs and run examples in VMs then this book is the most complete I've seen when it comes to covering the major sections of a data platform project using the Hadoop ecosystem of software. The topics discussed are all talking points I've had when consulting for large clients. Use this book as your roadmap.
The lack of command line and configuration examples does make this book more information-dense while not potentially out-dating itself too quickly. The English language can be a lot more powerful than pages of depreciated commands and poorly-formatted XML.
Before publishing this post I checked Amazon's UK site and there are 3rd-party suppliers selling this book brand new for £37. I believe the knowledge I've picked up from reading this book should produce an amazing return on the investment.