Data Virtualization and Big Data Business Intelligence
Data virtualization could be the glue that ties together data warehouses and social media data.
When I set out to write this article, I thought I could assume a basic understanding of data virtualization (DV) in readers and focus on the neat new benefits to agile business intelligence from using DV to combine Big Data with the data warehouse dynamically. Then I found, reading what's out there on the 'Net, that there's still a lack of understanding of what data virtualization really is and what it does. For example, as recently as a couple of months ago, Wikipedia was unsure of the relationship between data virtualization and "Enterprise Information Integration," (EII) which it said "failed in the market."
So before I note data virtualization's upcoming benefits, I am going to ask the reader to review the definition, history and existing benefits of DV, briefly, with me. I promise: the review will help.
Data, data everywhere
The basic aim of data virtualization, whose first solutions arrived around 2001, is to allow users to query across differing data sources in real time. That means that any DV solution needs three brand spanking new (in 2001) technologies:
1. A way to gather data from any data stores accessed by different vendors' databases or file systems or applications in real time;
2. A global metadata repository that shows not only what data was out there, anywhere, but also the relationships between the data in various data stores;
3. A common format across any and all data types that allows DV when it combines the data to present to the end user the relationships between the data, not just differing formats side by side.
And that has been the core of DV's value proposition. But not the whole story. Because, by definition, data virtualization aims to be (to stretch a much-abused word) "agile." That is, its ongoing value lies in its ability to keep pace with the proliferating number of data types out there in the world. A data warehouse, or a file system, achieves performance above all by refining its ability to process a particular type of data. DV piggybacks on this performance, but focuses on its own performance improvements in combining data types where the existing database has not done so. Over time, the "rich have become richer": the gap between what's stored in a data warehouse and a zettabyte's worth of a wide array of other data types stored all over the world has become ever wider, and DV continues to bridge that gap and assemble a richer and richer set of combined data and metadata.
But that is by no means the only DV value delivery that has surfaced since it arrived, because it turns out that data virtualization is a superb "Swiss army knife." You can take any of the three technologies I cited above and use it for other purposes, as well, simultaneously. You can combine DV as a whole with other infrastructure software, especially data management software, and create a full enterprise database architecture or global data architecture that makes everything look like it's in one consistent, real-time-data-available "virtual" database, complete with common XQuery data access. Here are a few more cute things you can do, with some tweaking:
· Data discovery – discover all the data you didn't know you had in your enterprise, and store their relationships in your very own global metadata repository;
· Master data management (MDM) – store at least some of the combined common-format data in a permanent data store;
· Real-time beyond-warehouse business intelligence – combine queries of data types not in the data warehouse, or stuff not yet in the data warehouse, with data warehouse queries;
· Merge corporate data as you merge corporations, immediately, without having to try to physically move and merge the databases and all their applications;
· And, of course, relate social media Big Data to data warehouse data without the major risks of downtime, poor performance, and inaccurate data that come with trying to move the Big Data into the data warehouse.
This last, of course, is what data virtualization vendor products such as Composite Software's Composite Information Server 6 are now offering. The rest is the bulk of actual uses of data virtualization, because vendors such as IBM put their DV technology inside their MDM, BI, "information server", "global metadata repository", and "data integration" solutions.
Now we come to the confusion about definitions. Originally, data virtualization was called "Enterprise Information Integration" – actually, a better description of its full capabilities, although it failed to capture the ability to access the data in real time. Over time, most of the original DV companies were bought by large infrastructure-software vendors, who continue to offer their products as standalones but whose major sales are as part of larger products – for example, Metamatrix to Red Hat, or Venetica to IBM. Then, when virtualization became the rage, other, smaller DV companies like Composite Software or Denodo Technologies were able to point out that, as noted above, data virtualization technology lets you mimic one gigantic "virtual" database, and EII was successfully re-christened as Data Virtualization. As you can see from the above, calling it data virtualization is completely justified by what DV technology does; but it can do much more.
And that is why saying "EII failed in the market" is a joke. Because if you have an MDM solution, DV technology is there. If you have real-time business intelligence that can reach data outside of the data warehouse, DV technology is there. And, of course, over the years, endless DV "projects" have accumulated inside large and medium-sized enterprises and in government. The DV vendors you see are the tip of the iceberg. Data virtualization is everywhere.
Wayne Kernochan of Infostructure Associates has been an IT industry analyst focused on infrastructure software for more than 20 years.