9 Useful Open Source Big Data Tools

Paul Ferrill

Updated · Nov 11, 2015

It’s impossible to talk about Big Data without mentioning Apache Hadoop. But Hadoop is just a part of a thriving Big Data software ecosystem. There are plenty of other Big Data platforms and tools, and many of them are open source.

Why are so many Big Data projects open source? There’s no definitive answer, but most likely it’s related to the fact that Hadoop is the project that got the Big Data bandwagon rolling. Since Hadoop is open source, many folks who work with it are active in the open source community. That means the tools they develop are also likely to be open source.

The rapid adoption of many Big Data projects is due in part to the fact that the software required is open source and can be downloaded and adopted at a departmental or even employee level before being embraced by the IT department.

Whatever the reason, the benefits to organizations are significant: Big Data software tools are freely available, and instead of paying for licenses companies can pay to have the open source code customized to their exact requirements if necessary. (Many open source tools are also offered on a commercial basis, with support offered to organizations that want to adopt them but lack the expertise to use source code unaided.)

The range of open source tools now available can be bewildering. Here we look at two of the hottest and most innovative areas: Big Data platforms themselves and Big Data search.

Big Data Platforms

Seven Big Data platforms covered here:

  • Lumify
  • Talend Open Studio for Big Data
  • HPCC Systems Big Data
  • Apache Storm
  • Apache Drill
  • Apache Samoa
  • Ikanow

Lumify is a relatively new open source project to create a Big Data fusion, analysis and visualization platform. Its Web-based interface allows you to discover connections and explore relationships in your data via a suite of analytic options, including 2D and 3D graph visualizations, full-text faceted search, dynamic histograms, interactive geographic maps and collaborative workspaces. Try Lumify

Talend Open Studio for Big Data lets you work with Hadoop and NoSQL databases. It provides simple graphical tools and wizards to generate native code that helps you leverage the full power of Hadoop. Download

HPCC Systems Big Data is a platform for manipulating, transforming, querying and data warehousing your Big Data and is an alternative to Hadoop. It uses the Thor data refinery, Roxie data query/delivery engine and Enterprise Control Language (ECL) as an alternative to Apache Pig. (ECL is claimed to be 4.45 times faster than Pig on average.)

The Community Edition is a free version of the HPCC Systems platform and is supported by an active community of developers and enthusiasts via online discussion forums. Download

Apache Storm is a distributed real-time computation system that allows you to process unbounded streams of data reliably. It does for real-time processing what Hadoop does for batch processing. You can use the software with any programming language. Download

Apache Drill is a SQL query engine for Big Data exploration. It has been designed from the ground up to support high-performance analysis on your semi-structured and rapidly evolving data coming from modern Big Data applications. Drill provides plug-and-play integration with your existing Apache Hive and Apache HBase deployments. Download

Apache Samoa  (Scalable Advanced Massive Online Analysis) is a platform for mining your Big Data streams. It is a distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms.

This enables you to develop new ML algorithms without directly dealing with the complexity of underlying distributed stream processing engines (DSPEs), such as Apache Storm, Apache S4 and Apache Samza. Build Apache Samoa

Ikanow is something slightly different: It claims to be the world’s first unstructured security analytics platform. The free Community Edition lets you tap into unstructured and structured data and delivers ingest, search, data widgets and export features in an open, self-supported platform. Download

Specialist Big Data Search Tools

Apache Solr  is designed to be highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and other features.

Solr powers the search and navigation features of many of the world’s largest Internet sites, and is built on Apache Lucene‘s Java-based indexing and search technology. Download

Elasticsearch  is a distributed, open source search and analytics engine, designed for horizontal scalability, reliability and easy management. It combines the speed of search with the power of analytics via a query language that has been designed to be developer-friendly, covering structured, unstructured and time-series data. Download

Paul Rubens has been covering enterprise technology for over 20 years. In that time he has written for leading UK and international publications including The Economist, The Times, Financial Times, the BBC, Computing and ServerWatch.

Paul Ferrill
Paul Ferrill

Paul Ferrill has been writing for over 15 years about computers and network technology. He holds a BS in Electrical Engineering as well as a MS in Electrical Engineering. He is a regular contributor to the computer trade press. He has a specialization in complex data analysis and storage. He has written hundreds of articles and two books for various outlets over the years. His articles have appeared in Enterprise Apps Today and InfoWorld, Network World, PC Magazine, Forbes, and many other publications.

More Posts By Paul Ferrill