9 Useful Open Source Big Data Tools
Hadoop is not the end-all, be-all of Big Data. There are lots of other Big Data platforms and tools, many of which are open source.
It's impossible to talk about Big Data without mentioning Apache Hadoop. But Hadoop is just a part of a thriving Big Data software ecosystem. There are plenty of other Big Data platforms and tools, and many of them are open source.
Why are so many Big Data projects open source? There's no definitive answer, but most likely it's related to the fact that Hadoop is the project that got the Big Data bandwagon rolling. Since Hadoop is open source, many folks who work with it are active in the open source community. That means the tools they develop are also likely to be open source.
The rapid adoption of many Big Data projects is due in part to the fact that the software required is open source and can be downloaded and adopted at a departmental or even employee level before being embraced by the IT department.
Whatever the reason, the benefits to organizations are significant: Big Data software tools are freely available, and instead of paying for licenses companies can pay to have the open source code customized to their exact requirements if necessary. (Many open source tools are also offered on a commercial basis, with support offered to organizations that want to adopt them but lack the expertise to use source code unaided.)
The range of open source tools now available can be bewildering. Here we look at two of the hottest and most innovative areas: Big Data platforms themselves and Big Data search.
Big Data Platforms
Seven Big Data platforms covered here:
- Talend Open Studio for Big Data
- HPCC Systems Big Data
- Apache Storm
- Apache Drill
- Apache Samoa
Lumify is a relatively new open source project to create a Big Data fusion, analysis and visualization platform. Its Web-based interface allows you to discover connections and explore relationships in your data via a suite of analytic options, including 2D and 3D graph visualizations, full-text faceted search, dynamic histograms, interactive geographic maps and collaborative workspaces. Try Lumify
Talend Open Studio for Big Data lets you work with Hadoop and NoSQL databases. It provides simple graphical tools and wizards to generate native code that helps you leverage the full power of Hadoop. Download
HPCC Systems Big Data is a platform for manipulating, transforming, querying and data warehousing your Big Data and is an alternative to Hadoop. It uses the Thor data refinery, Roxie data query/delivery engine and Enterprise Control Language (ECL) as an alternative to Apache Pig. (ECL is claimed to be 4.45 times faster than Pig on average.)
The Community Edition is a free version of the HPCC Systems platform and is supported by an active community of developers and enthusiasts via online discussion forums. Download
Apache Storm is a distributed real-time computation system that allows you to process unbounded streams of data reliably. It does for real-time processing what Hadoop does for batch processing. You can use the software with any programming language. Download
Apache Drill is a SQL query engine for Big Data exploration. It has been designed from the ground up to support high-performance analysis on your semi-structured and rapidly evolving data coming from modern Big Data applications. Drill provides plug-and-play integration with your existing Apache Hive and Apache HBase deployments. Download
Apache Samoa (Scalable Advanced Massive Online Analysis) is a platform for mining your Big Data streams. It is a distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms.
This enables you to develop new ML algorithms without directly dealing with the complexity of underlying distributed stream processing engines (DSPEs), such as Apache Storm, Apache S4 and Apache Samza. Build Apache Samoa
Ikanow is something slightly different: It claims to be the world's first unstructured security analytics platform. The free Community Edition lets you tap into unstructured and structured data and delivers ingest, search, data widgets and export features in an open, self-supported platform. Download
Specialist Big Data Search Tools
Apache Solr is designed to be highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and other features.
Elasticsearch is a distributed, open source search and analytics engine, designed for horizontal scalability, reliability and easy management. It combines the speed of search with the power of analytics via a query language that has been designed to be developer-friendly, covering structured, unstructured and time-series data. Download
Paul Rubens has been covering enterprise technology for over 20 years. In that time he has written for leading UK and international publications including The Economist, The Times, Financial Times, the BBC, Computing and ServerWatch.