Hadoop Evolution: What You Need to Know
Updated · May 23, 2016
It’s been a decade since Hadoop became an Apache software project and released version 0.1.0. The open source project helped launch the Big Data era, created a foundation for most of the big cloud platform providers and changed how enterprises think about data.
Despite Hadoop’s rocket evolution from a Google pet project to a technology stack with major distributions and cloud providers, many enterprises still find Hadoop difficult, experts say. Rather than becoming simpler and easier, Hadoop spawned an entire ecosystem of open source tools and technologies, including Mesos, Spark, Hive, Kafta, Zookeeper, Phoenix, Oozie, HBase — all tied directly or indirectly to Hadoop.
In this article, we discuss:
- Who is driving Hadoop adoption in the enterprise
- How early design decisions hampered Hadoop
- How its open source licensing model affects Hadoop
- Why companies are using the cloud and platform-as-a-service (PaaS) with Hadoop
- How and why companies are moving away from huge Hadoop data clusters
Making Sense of Hadoop
How can enterprise executives make sense of this sprawling Hadoop ecosystem?
“It’s a struggle,” acknowledged Nick Heudecker, who researches data management for Gartner’s IT Leaders (ITL) Data and Analytics group. “Hadoop doesn’t typically get better by improving the things that are already there; it gets better by adding new stuff on top of it, and that consequently makes things much more complicated.”
Hadoop Adoption: Backed by the Business
Even trying to assess Hadoop adoption is more complicated than it should be. Last year, Gartner surveyed 284 of its Gartner Research Circle members and found enterprise Hadoop adoption was falling short of expectations, especially given its hype. Fifty-four percent of survey respondents had no plans to invest in Hadoop, and just 18 percent had plans to invest over the next two years. What’s more, Heudecker noted, early adopters didn’t appear to be championing further Hadoop usage.
A TDWI survey of 247 IT professionals published at about the same time supported a conflicting conclusion: Many enterprises (46 percent) were already using Hadoop to complement or extend a traditional data warehouse, and 39 percent were shifting their data staging or data landing workloads to Hadoop. Other surveys, like one from AtScale, did as well.
Philip Russom, research director for data management with TDWI Research, consulted with Gartner’s Merv Adrian about the discrepancy and discovered something surprising. Gartner had primarily talked to CIOs and other C-level executives while TDWI primarily consulted with data management professionals.
“Long story short, Hadoop is not being adopted as a shared resource, owned and operated by central IT,” Russom said via email. “However, it is being adopted briskly ‘down org chart’ as a Big Data platform and analytics processing platform for specific applications in data warehousing, data integration and analytics. And those applications are sponsored, funded and used by departments and business units – not central IT.”
Heudecker said that still matches what Gartner’s seeing. It may also help explain why enterprises seem to be so iffy about Hadoop: Despite Hadoop’s technical learning curve, business units seem to be dabbling in it more than central IT.
“It’s very rare to see enterprisewide deployments that are run as a Hadoop center of excellence, for instance,” Heudecker said. “It’s hard to really pin down one reason why that’s happening.”
One reason may simply be that business units control a growing portion of the technology spend, he said. Business users want self-service data, which can mean everything from self-service data preparation to self-service integration and analytics. It’s also creating a demand for accessing Hadoop through existing business intelligence or analytic tools, but those tools still need to improve, he cautioned.
Hadoop’s Persistent Problem
Hadoop has been limited by its own design as well as recent changes in the technology world.
Hadoop and its first processing engine, MapReduce, were developed as a tool for technology’s elitist data analysts, and not much changed on the way to distribution. In many ways, the open source technology stack has been its own worst enemy, from MapReduce’s disk-centric approach and demand for specialist programming skills down to Hadoop’s batch-oriented approach.
“That’s the big limitation with Hadoop; it’s a batch-oriented data layer and, as companies start to get more serious about Hadoop, they’re moving into ‘how do I get real-time, how do I start impacting the business,'” said Jack Norris, senior vice president of Data & Applications at MapR, a Hadoop-derived startup. “To do that with Hadoop at the center, you’ve got to do a lot of things to try to make up for the fact that it’s got a weak underlying data layer.”
MapR avoided the problem by rewriting that data layer rather than using the Apache Hadoop distributed file system, Norris added.
MapReduce and Hadoop were also originally designed to run clusters on commodity hardware back when memory was very expensive, Heudecker pointed out. That need has diminished in as in-memory processing has become cheaper.
That’s where Spark shines, since it uses in-memory processing, which is faster than a disk-centric approach. Spark is getting love from companies ranging from IBM, which has opened a Spark technology center and introduced a number of Spark-centric solutions, to Cloudera, which made Spark a focal point of its latest release. Proprietary appliances that leverage in-memory processing have also come to market, which further skewed the market for Hadoop.
But no matter how you mix up the ecosystem, these open source tools still aren’t easy. That is Hadoop’s most persistent problem: It requires skills that even enterprises struggle to hire.
The Open Source Conundrum
Hadoop’s open source licensing model also played an unintentional role in driving complexity, Heudecker said.
“Open source has been effectively weaponized by these vendors so everyone has a vested interest in X project versus Y project, depending on where you have allocated your committers that work for your company,” he said. “Open source is phenomenal; it really is. It has completely changed the game for how enterprises look at acquiring software, but it’s not this altruistic effort any more. There’s big money in open source software. So you’ll see some companies supporting project X over project Y because that’s what they ship.”
The open source community may also be more focused on developing the technology over supporting data management best practices. Many Hadoop data lakes either don’t support or offer inadequate support for audit trails, data integrity, data quality, encryption or data governance, Russom said.
“It’s not all rainbows and unicorns,” Russom wrote. “I don’t see the open source community caring much about these issues in a Hadoop environment.”
Hadoop, Amazon and New Tools
Despite Hadoop’s limitations and scattered evolution, experts say it’s not going away.
That may be why more companies are looking to the cloud to handle Hadoop. Gartner estimates that Amazon has over twice as many users of EMR, its Hadoop service, than all of the startup Hadoop distributors combined. Cloud allows companies to separate compute from storage, so they can spin up more clusters as needed, then tear them down rather than maintaining them simply to store the data.
Other vendors are also introducing new tools to help close the data capabilities gap, Russom pointed out.