How Is Hadoop Evolving to Meet Big Data Needs?
Author of "Hadoop: The Definitive User Guide" offers an update on the current state of Hadoop and where it is headed.
Hadoop just keeps on growing. According to Allied Market Research, the Hadoop market is worth around $3 billion currently and will surpass $50 billion by the end of the decade. So what is propelling Hadoop to such a meteoric rise?
It made a name for itself initially as an open source technology that facilitated the storage and analysis of large volumes of data. Managed by the Apache Software Foundation, it gained ground in companies like Facebook, Google, Yahoo and Amazon as a repository for unstructured data.
Since its early days, though, Hadoop has evolved well beyond that. As a sign of its growing popularity, one of the most popular books on the subject, "Hadoop: The Definitive User Guide," by Tom White, an engineer at Cloudera, just came out in its Fourth Edition.
Enterprise Apps Today caught up the author to find out what’s new in the latest edition and how Hadoop is evolving.
"In the face of increasing data volumes, Hadoop has proved itself to be the standout platform for general data processing that applies to many use cases," said White. "The book was revised as Hadoop has been moving so fast."
Hadoop began initially as quite a small project within Apache, White noted. Back then, it consisted of the Hadoop Distributed File System (HDFS) with the MapReduce compute engine running on top of it. That was powerful enough to catapult Hadoop into the limelight and for some to even call it the EMC killer, as it utilized commodity hardware in place of proprietary storage arrays.
6 Hadoop-related Projects
Hadoop has grown over the last six years into a much larger ecosystem with dozens of projects. White noted some of the more prominent ones:
- Spark. This new processing engine improves upon MapReduce, White said. While MapReduce is reliable and still has its uses, Apache Spark is likely to replace it, at least for new workloads going forward. Spark is faster, has an application programming interface (API) that is easier to integrate and use, supports existing data and is compatible with existing data formats. "Spark is great for interactive data analysis and looking into the data when you are unsure where you want to go with an analysis," said White.
- Crunch. Apache Crunch runs on top of MapReduce and Spark as an API to facilitate some of the more tedious aspects of working with MapReduce, such as data aggregation, as well as processing of unstructured data.
- Flume. This is a distributed service for collecting and transporting large amounts of data using streaming data flows. It includes fault tolerance, failover and recovery mechanisms. Using Flume, Hadoop can ingest data from multiple sources such as social media, application logs, sensors and geo-location information into HDFS at higher volume for analysis.
- Parquet. This is a storage format for Hadoop that enables better data processing. It changes the traditional database approach of storing information in rows to storing data in a columnar pattern. This makes it easier to compress data and speeds up queries.
- Yarn. This newer Hadoop core makes it possible for multiple processing engines to address stored data. It adds resource management, centralization, more consistent performance, enhanced security and better governance of data across Hadoop clusters.
- Kafka. One of the latest additions, Apache Kafka adds a messaging system to Hadoop that is said to be faster, more reliable and more scalable than anything available previously. This is useful, for example, in the analysis of geospatial data from vehicle fleets or sensor networks as it can deal with massive streams of messages simultaneously.
These are just a few of the projects that have blossomed as part of the Apache/Hadoop ecosystem, said White.
A big change from the earlier editions of White's book is the disappearance of coverage of the initial version of Hadoop, known as Hadoop One or HD1. Since HD2 came out, there has been a steady migration of users to it. With relatively few now remaining on HD1, White said it's time to focus solely on HD2 in his work.
Hadoop as Central Data Hub
Allied Market Research forecasts annual Hadoop growth of around 60 percent. So where is the platform heading as part of its global conquest? White sees it moving up the enterprise food chain as the technology matures.
"Hadoop is increasingly moving into the center of the enterprise to be used as a massive data hub," he said. "You can just dump a ton of data into it and use all of the tools that the ecosystem provides for real-time ingest and analysis or batch processing."
White noted that Hadoop is being harnessed more for real-time or close to real-time ingest and processing workloads. He sees a lot of promise in Kafka, which allows you to publish data into Hadoop and react to it quickly. At the same time, it enables improved long-term archiving of that data in HDFS.
The biggest challenge that lies ahead, White believes, is making it easier to build applications in Hadoop. He thinks this is the area where the platform needs to mature the most if it wants to achieve its full potential.
On the Cloudera side, White said his company has recently been involved in helping to create health care applications that can store and analyze thousands of massive genomics data sets. The company is also doing packaging and tools to help operate large clusters. Cloudera Manager, for example, helps upgrade clusters and manage them better.
"Hadoop-based tools for clustering have improved, but it takes a lot of understanding to make them work without Cloudera," he said.
Drew Robb is a freelance writer specializing in technology and engineering. Currently living in Florida, he is originally from Scotland, where he received a degree in geology and geography from the University of Strathclyde. He is the author of Server Disk Management in a Windows Environment (CRC Press).