Cloudera Accelerates Big Data with Impala GA

Sean Michael

Updated · May 01, 2013

Six months ago, enterprise Big Data vendor Cloudera announced the open source Impala project. The goal of Impala is to bring a real-time query engine to Hadoop Big Data.

This week Cloudera is officially declaring Impala to be generally available and ready for deployment.

“Impala is a parallel database query engine that runs natively on top of the Hadoop platoform,” Charles Zedlewski, VP of Product at Cloudera, explained to Enterprise Apps Today. “It runs natively on Hadoop storage, with the same schema, Hive metastore and file formats.”

Zedlewski stressed that Impala can be deployed in an existing Hadoop cluster without the need for any architectural changes. In that way, Impala can be used as an overlay for existing data sets.

“It is SQL and the first word in SQL is structured, so I don’t want to pretend that you can point Impala at a bunch of video files for a query,” Zedlewski said. “But Impala does not require that you have to massage your structured data into a special format.”

Impala was started as an open source project licensed under the Apache license and it remains as such today. Cloudera has added a commercially supported offering called RTQ, which provides additional monitoring and management interfaces.

“RTQ is what customers pay us for if they want support, but anyone can use Impala for free,” Zedlewski said. “With RTQ you also do get management automation.”

The traditional use case for early deployments of Hadoop was for batch processing, partially due to the latency overhead that the MapReduce Hadoop technology introduces.

“There are a lot of workloads that customers have where they want responses in under five minutes,” Zedlewski said. “Impala fits nicely into that use case.”

The speed Impala provides also enables interactive business intelligence (BI) workloads on Hadoop Big Data, he added.

Importance of In Memory

When it comes to Big Data query speed, the ability to enable in-memory analytics and functionality is critical. Impala is already capable of doing a number of its steps in memory. Zedlewski explained that all of the query joins are broadcast in memory, and it’s one of the sources of Impala’s performance.

“Impala also makes use of the existing cache that is resident in a Hadoop cluster,” Zedlewski said.

Now that Impala is generally available, the project will continue to evolve, though not necessarily at the same rate as Hadoop itself. The Apache Hadoop project includes over 13 different projects, each with a different release timeline. The open source Cloudera Distribution for Hadoop (CDH) is all about providing a form of synchronized milestone releases of Hadoop.

“Since Impala is advancing so quickly, we do plan on doing additional updates to Impala off-cycle so we can rev it faster,” Zedlewski said. “With our core Hadoop release, people rely on it for some pretty important stuff now, so they expect a bit of a slower release cycle.”

For now Impala will keep to its own release cycle, he said, although at some point it will sync up with the broader Cloudera release cycle.

Sean Michael Kerner is a senior editor at Enterprise Apps Today and InternetNews.com. Follow him on Twitter @TechJournalist.

Sean Michael
Sean Michael

Sean Michael is a writer who focuses on innovation and how science and technology intersect with industry, technology Wordpress, VMware Salesforce, And Application tech. TechCrunch Europas shortlisted her for the best tech journalist award. She enjoys finding stories that open people's eyes. She graduated from the University of California.

More Posts By Sean Michael