Coupling Semantics, ETL and Data Marts – Better than Virtualization?
Updated · Jan 19, 2011
Last year – and probably it will be this year – data federation or data virtualization (or the technology formerly known as EII) – received a good deal of attention, in part because of the whole “virtualize everything” trend.
I wrote about it a few times, focusing on how data federation supports integration.
Add federation to the Semantic Web, and the picture becomes really rosier, but that’s more fairy tale than reality, contends Rob Gonzalez in a recent Semanticweb.com column entitled, “I‘ve got a Federated Bridge to Sell You (A Defense of the Warehouse).” I suspect it will be hard for business leaders to follow in-depth. Still, anyone can skim it to get the gist of his argument and how it might impact your organization’s approach to data.
For IT leaders, I think it’s a must-read. It’s also a timely topic, given recent discussions about how organizations can make better use of existing ETL tools.
He attacks the federated approach on two counts: performance and query functionality. He also tackles the idea that federation gives you fresher data. In short, he’s not just arguing that semantic technologies work well with ETL and consolidating data into data marts, he actually contends this combination creates a much better experience than federation:
I believe that coating old, weather-beaten databases with a coat of semantic paint is awesomely valuable. It makes creating ETL pipelines that bring together data from all kinds of locations a breeze as compared to traditional, relationally-oriented ETL pipelines. It’s hardly even fair to compare the two approaches, except insofar as the maturity of the traditional technologies is concerned … In fact, I see semantics as enabling on-demand datamarts in ways that traditional data integration technologies simply have failed to do.
He contrasts this with the federated approach of querying traditional databases during query processing:
In effect you’re asking completely unanticipated questions of traditional databases, which were not designed to handle unanticipated questions with any sort of performance guarantee. … Most DBAs are happy to provide you with data dumps on a schedule, and maybe with a way to query for updates more regularly, but will absolutely not let you put their already strained transactional systems under additional load from your ad hoc queries.
Gonzalez does acknowledge that there are certain situations involving network security where data marts simply won’t work and you need to use federation for queries. Be sure to check out the reader comments, too. There are several worthwhile discussions, including one post on whether federation provides a better solution when you’re dealing with large data sets.