Diving into Data Lakes

Drew Robb

Updated · Apr 08, 2015

There is a lot of hype out there about the wonders of data lakes, as well as cautions about the dangers of them turning into data swamps. Much of this debate about the true value of data lakes is premature. After all, we are in the very early stages of the development of this technology.

What we can say, though, is that the potential of the data lake concept fills an existing need.

“With the growing volume, variety and velocity of data, and with so much of that data locked up in application silos or collected from unstructured data sources, organizations are struggling to quickly ingest and use it to make better decisions,” said Ash Parikh, vice president of Product Marketing, Data Integration and Security, at Informatica. “Traditional solutions to these challenges are often expensive, manual and complex, with business analysts at some organizations spending up to 80 percent of their time in preparing data instead of driving new insights. To maximize the potential of Big Data analytics, what is needed is comprehensive Big Data management with data intelligence.”

So that’s the promise, or at least the hope, of the data lake as a means of opening the door to far more comprehensive and accurate analytics. While it is too early in the game to predict which approaches to data lakes will prove most effective, Enterprise Apps Today has been looking into some of the different approaches.

Business Data Lake: Capgemini, Informatica, Pivotal

Capgemini, Informatica and Pivotal are partnering to provide a data lake ecosystem. The Business Data Lake combines Informatica’s data integration software and Pivotal’s platform for Big Data, analytics and applications, as well as Capgemini’s experience in enterprise implementation. Let’s look at the some of the underlying elements.

Pivotal’s Big Data suite is said to allow companies to modernize their data infrastructure, utilize analytics capabilities and build analytic applications at scale and therefore more rapidly innovate by combining agile application development frameworks with advanced analytics.

“Our partnership with Capgemini and Informatica combines the complete data portfolio and data science expertise from Pivotal, business information management expertise and best practices from Capgemini and Informatica’s leadership in data integration and master data management capabilities,” said Sai Devulapalli, consulting product marketing manager, Data Analytics at Pivotal.”The Business Data Lake enables enterprise customers to leverage a complete product and services offering, enabling them to focus on business problems and use cases as opposed to technology integration and life-cycle management.”

Data lakes, said Devulapalli, leverage technology advancements in collocated compute-plus-storage clusters and the latest in-memory capabilities to address non-traditional data sources such as mobile, Internet of Things, click stream and social data that don’t necessarily fit into pre-existing data models. By combining batch-mode, interactive and streaming analytics, as well as predictive analytics and low-latency processing, data lakes are said to enable enterprises to take data-driven business decisions.

“Enterprises should view the data lake as a complete platform for analytics and not just as a data storage framework,” said Devulapalli. “As such, data lakes need to support standard data interfaces such as SQL, so enterprises can leverage existing skills and tools to quickly address business use cases.”

Data lakes could have large switching costs due to the use of proprietary interfaces or management and security frameworks, Devulapalli cautioned. To avoid vendor lock-in, he advised users to gravitate toward open platforms with standardized management frameworks when considering data lakes.

Informatica Vibe Data Stream, Big Data Edition and Informatica Big Data Relationship Manager are the next elements of the Business Data Lake. Vibe Data Stream is said to provide near universal connectivity to all data sources in order to collect data in flight and enable more timely analytics. After ingestion into Hadoop, it can profile and cleanse data. Big Data Relationship Manager is there to discover relationships on Hadoop to enable a more holistic view of datasets.

“A Hadoop cluster that simply captures raw data that may be incomplete, inconsistent or insecure is of limited value to analytics consumers,” Parikh said.

Data Lake Foundation: EMC

EMC’s Data Lake Foundation is said to be the storage infrastructure that enables an organization to deploy a data lake strategy for their Big Data analytics infrastructure by bringing data applications and analytics together regardless of source or destination. Rather than trying to assemble its own analytics technology, EMC sticks to its storage and content management roots and endeavors to make it easier to provide an underlying storage foundation for the data lake.

That foundation is established on a complex series of EMC storage boxes which span just about every style of enterprise storage imaginable. The company’s long experience in storage management and integration comes into play to make these disparate elements work together so that those performing Big Data analytics don’t have to worry about the underlying plumbing.

The goal is to eliminate storage silos, simplify data management, improve utilization, provide massive scalability, operational flexibility and incorporate a way to protect data with backup, disaster recovery and security technologies. What we have then is a collection of shared storage that supports protocols such as Hadoop File System (HDFS) and others used in Big Data analytics.

“Most organizations spend a tremendous amount of time, effort and budget moving data from their source systems (the place where data is born) to the analytics and finally to the place where insights are consumed,” said Suresh Sathyamurthy, senior director of Product Marketing, EMC Emerging Technologies Division. “With a single shared storage repository, enterprises can securely store data directly from the source systems, analyze it in place and provide the results through the same system to the point of use while meeting compliance and governance requirements.”

EMC’s bread and butter for decades has been storage hardware and software. Along comes a bunch of wildcats from the open-source crowd proclaiming Hadoop as the ideal home for unstructured data. It’s only natural that the biggest name in storage would object. Its answer is the Data Lake Foundation, with an architecture that offers to store everything and let analytics sit on top of it.

“Businesses can eliminate data consolidation services, ETL processes, data analytics silos like Hadoop, and separate infrastructures for applications like SAS, Splunk or Tableau,” said Sathyamurthy. “All of these applications can work on the same source copy of the data simultaneously.”

The EMC pitch, then, is that Hadoop by itself is not a data lake as it does not support multiple protocols natively, lacks the necessary monitoring and security tools, and requires separate ETL and data migration infrastructures to be established.

“The EMC approach is actually a bit more pragmatic and inclusive than some others as they are viewing data lakes as encompassing not only Big Data for Hadoop analytics, but also such things as videos or images,” said Greg Schulz, an analyst with Server and StorageIO Group.

More Making Data Lake Waves

There are, of course, plenty of others making waves in the data lake. You have the likes of Hortonworks and Splice Machine that see data lakes primarily from a Hadoop standpoint. Such offerings are steadily growing in sophistication.

PriceWaterhouseCoopers (PwC), for instance, has taken Hadoop as the main data repository and supplemented it with fine-grained microservices associated with a single business function as well as container technology such as Docker to extend virtualization and make applications portable across clouds.

PwC has been walking the walk with an implementation at UC Irvine Medical Center that maintains millions of records for more than a million patients. This repository includes radiology images and other semi-structured reports and unstructured physicians’ notes, as well as spreadsheet data. Its data lake is based on a Hadoop architecture. To date, it has been used successfully to predict the likelihood of readmissions and take preventive measures to reduce their number.

From more of a traditional storage viewpoint IBM is no doubt cooking up something, although as usual, it is not quick to join in the hype-fest. But it already has a wealth of Big Data storage technology as well as an abundance of analytics at its disposal, including IBM Watson, which famously won at Jeopardy. But to date, no tangible data lake plans have been announced by Big Blue.

If the data lake concept takes off, it’s a given that the likes of Cisco, HP and Dell will quickly jump into the water as well.

“Then there are the cloud providers such as AWS, Google Cloud Storage (GCS), IBM Softlayer, Microsoft, Rackspace and others that could float their boat in the data lake if they wanted,” said Server and StorageIO Group’s Schulz. “While there is a temptation to associate data lakes as being exclusive to Hadoop Big Data analytics and data scientists due to some industry messaging, the reality is that just like those that are filled with water, there are many different types, shapes and sizes of lakes that are used for and support many different things.”

Perhaps the best way to think of a the data lake concept, Schulz added, is as “something that can start as a pool or pond, expanding as it accumulates more unstructured data to evolve into a lake, or perhaps a sea or ocean of data.”

Drew Robb is a freelance writer specializing in technology and engineering. Currently living in Florida, he is originally from Scotland, where he received a degree in geology and geography from the University of Strathclyde. He is the author of Server Disk Management in a Windows Environment (CRC Press).

Drew Robb
Drew Robb

Drew Robb is a writer who has been writing about IT, engineering, and other topics. Originating from Scotland, he currently resides in Florida. Highly skilled in rapid prototyping innovative and reliable systems. He has been an editor and professional writer full-time for more than 20 years. He works as a freelancer at Enterprise Apps Today, CIO Insight and other IT publications. He is also an editor-in chief of an international engineering journal. He enjoys solving data problems and learning abstractions that will allow for better infrastructure.

More Posts By Drew Robb