How to Choose a Hadoop Distribution
Choosing which of the three types of Hadoop distributions will work best for your organization largely comes down to where you are in your Big Data journey.
By Michele Nemschoff, MapR Technologies
Apache Hadoop has gotten a lot of press, but choosing the right Hadoop distribution for your company is easier said than done. All of the related projects associated with Hadoop, like Pig and Hive and HBase, are independent, which means they all have to be installed and integrated manually. Add to that the fact that you will have to deal with conforming and updating different versions on your own because there is no commercial support, and Apache Hadoop could require more work than you might expect.
Thankfully, companies have started making their own distributions of Hadoop that have tested and hardened the open-source version, and that provide support and help simplify the installation process. Commercial distributions for Hadoop assemble the various enhancement projects from the Apache repository and present them in a unified product so businesses don’t have to embark on a science project of assembling each of these elements into a functional whole.
Vendors of Hadoop distributions often vary in what they offer. Sometimes, the vendor offers the basic open source software with support, consulting and education services. However, some vendors also offer additional innovations to ease the development, administration and operations of Hadoop.
The distributions offered can be divided into three groups based on the types of services and innovations they provide:
Core Hadoop Distribution
Some distributions offer a service-only model of Hadoop. These platforms strive to stay as close to the original open-source structure of Hadoop as possible. Generally, the only enhancements available with these distributions are those created and offered by the Hadoop community, but these distributions will usually offer support with using the product.
The benefits of using a product that is just the core open-source distribution are that it’s usually free, and enhancements by the open-source community are passed on to users. The disadvantage of this choice is it does not contain extra enhancements to make it user friendly or enterprise-grade.
Some vendors provide an additional layer of management software that helps administrators configure, monitor and tune Hadoop, reducing the level of expertise required to manage Hadoop. These platforms also offer training and consulting services to help companies integrate Hadoop into their data management strategies. The biggest advantage is the enhanced management and monitoring capabilities.
Enterprise Reliability and Integration
A third class of Hadoop vendors offers a more robust solution, with a management layer augmented with connectivity to existing enterprise systems and engineered to provide the same high level of availability, scalability and reliability as other enterprise systems. Features such as support for NFS and data protection through the use of snapshots that offer point-in-time recovery of files and tables to protect against user or application error transform Hadoop from a relatively young open source platform to an enterprise-ready Big Data platform.
Of course, any community innovations are incorporated as well as the vendor’s innovations.
Choosing the Right Hadoop Distro
Ultimately the type of distribution you choose will depend on the needs of your company. Are you just looking to experiment with Hadoop to see if it may be a good technology to invest in? A basic distribution with just basic support may be all you need.
On the other hand, will Hadoop become an important part of your data strategy? Will Hadoop contribute to the ROI of multiple departments within the company? In that case, an enterprise Hadoop distributor will likely better suit your needs, as it will allow you to start applying Big Data to results faster.
Michele Nemschoff is vice president of Corporate Marketing with MapR Technologies, a company that brings dependability, ease-of-use and speed to Hadoop, NoSQL, database and streaming applications in a unified Big Data platform.