Which Hadoop Distribution Is Right For You?
Selecting a Hadoop distribution can seem pretty daunting. But it all comes down to which platform best suits your Big Data needs.
By Michele Nemschoff, MapR Technologies
The thought of evaluating whether a particular technology is right for your company can conjure up feelings of stress, anxiety and ambiguity, especially when fielding ROI questions from the CIO and CFO. As more and more companies evaluate Hadoop, they are finding themselves in a similar situation, if not worse. Hadoop has great promise, but it can be a difficult technology to understand and it is moving at a rapid pace.
Assuming Hadoop solves the problem at hand, how do you decide which Hadoop distribution to pick? They all look similar from the outside; all of them package more than a dozen open source software components, work on commodity hardware and can pretty much run similar sets of analytical workloads. Yet there is a marked difference in terms of what you get for your money. When evaluating Hadoop distributions, here are some questions to ask.
Can It Stand Alone?
At the heart of the matter lies the question of whether you are buying licensed software or buying services for free software. Although the promise of support services via "hand-holding" and "community-based support" feels invaluable when you start off on an unknown journey with a new technology, you need to recognize that it is a piece of technology that will be going into your production environment, and you should hold it to the same standards as any other technology in your data center. Enterprise-grade products remove the need to rely on third-party support.
Is It Reliable?
A major weakness that has been pointed out about Hadoop technology is that the NameNode that is used to locate and keep track of all of the other nodes related to a certain data set is a single point of failure. In other words, if the NameNode fails, all of the data in the other nodes is lost because it can’t be found without the NameNode. While Hadoop is still working on correcting this issue with its version 2.0, some platforms offer alternatives that eliminate the NameNode and its vulnerability.
Planned upgrades to Apache Hadoop also require outages, which could lead to contention as departments have projects they want to complete and don’t want to wait for the system to update and reboot. Some Hadoop distributions have also come up with alternatives to this problem with rolling upgrades.
When you begin the task of evaluating Hadoop distributions, make sure you thoroughly understand the reliability and ease-of-use features of the different Hadoop distributions. Ask a lot of questions and test the average outage hours, data loss scenarios, administrative overheads, recovery support and integration with the rest of the tools and applications in the enterprise. Once you have those results, you can talk to reference customers about their experiences with support knowledge and timeliness.
Hadoop and the ROI Question
Be sure to consider ROI when evaluating various distributions. CEOs are going to be critical of investing in any new technology because they know these investments end up failing just as much as they succeed. At the same time, they know that they can’t fall behind their competition and good investments are key to succeeding in this economy.
Unfortunately, Big Data technology doesn’t allow for a hard ROI model as companies will have to experiment with what kinds of questions they are trying to answer and what data is available to them. However, the Hadoop distribution you choose should allow you to construct a plausible path to significant value, along with an outline of a plan for broad adoption if the technology works out. Include in this outline some initial questions that can be answered by the technology, some processes that will be improved, some decisions that data could affect and the suspected business impact.
Is It Manageable?
Related to ROI, you will need to consider the cost of integrating a new system and potentially spending time training or hiring new staff to handle the system. Consider whether you will need additional expertise to integrate Hadoop’s many software components and/or to integrate data sources with Hadoop to access data from already existing systems. Some distributions may require custom connectors rather than standard interfaces, and you’ll need to find out if the distribution’s management interface can be operated easily by your IT staff.
Does the Platform Fit Your Needs?
All the major Hadoop distributions offer something a little different. While all vendors offer core Apache Hadoop, some vendors offer additional support, education and consulting services and others include some management tools. In the end, it all comes down to what you need from each distribution, as all vendors offer varying levels of innovation and services on top of the core Hadoop software.
Editor's Note: For more good advice on this topic, see Michele's earlier article on How to Choose a Hadoop Distribution.
Michele Nemschoff is vice president of Corporate Marketing with MapR Technologies, a company that brings dependability, ease-of-use and speed to Hadoop, NoSQL, database and streaming applications in a unified Big Data platform.