Making a Business Intelligence Case for Columnar Databases
Wayne Kernochan of Infostructure Associates says columnar databases are making a comeback – and for good reason.
A couple of years ago, columnar databases were a hot topic in business intelligence circles; today, not so much. Yet a quick scan of the Transaction Processing Council’s top 10 list for its TPC-H business intelligence benchmark (as of 1/23/12) reveals an amazing disparity between columnar databases and the traditional relational row-oriented databases.
In most categories, Exasol or Vectorwise was the leader in performance by an order of magnitude over Oracle Database or IBM DB2. Three-year-old Sybase IQ implementations were still in the top 10, and even ParAccel got into the act with an older result that beat all row-oriented relational competitors.
As for price/performance, in the 1-TB (terabyte) TPC-H area an Exasol result achieved $0.18/QphH on parallel Dell machines, 10 times less per unit of performance as its nearest relational competitor, the always inexpensive Microsoft SQL Server. And that’s just one example.
Columnar isn’t just beating the Oracles, Microsofts, and IBMs of the world in performance and price/performance. It’s whomping them, all the way up to the 10-TB level.
Well, you say, sooner or later Oracle or IBM will catch up. After all, they have done it before with other database technologies, such as XML and object databases and object-relational technology. Eventually, say the skeptics, history shows row-oriented relational will match columnar. Actually, history shows the opposite.
Columnar databases are designed from the ground up to store data by columns rather than rows and have been so since the 1970s, when a columnar (then known as inverted-list) database called Model 204 often demonstrated performance advantages over its mainframe-database competitors. Yes, row-oriented relational can adopt column-based data compression in certain instances. If you look under the covers of bit-mapped indexing, that major jump in row-oriented relational performance is really columns with two values stored as bits in main memory.
However, to capture all the compression techniques that columnar databases can use, row-oriented relational would have to undergo a fundamental redesign into a primarily columnar architecture – a risky process that would probably take at least three years. Therefore, columnar databases’ edge in compressing data for faster in-memory processing will likely remain, all else being equal – and the TPC-H results show that any other advantages row-oriented may possess do not negate this fundamental columnar advantage.
It is also true that where only a few columns per query can be compressed, row-oriented relational has a significant (but not order-of-magnitude) performance advantage. However, the effort that must go into finding the right way to split a stream of transactions between row-oriented and column-oriented storage, as Oracle Exadata does, is significant. The TPC-H results suggest that it may or may not be worth it.
So why haven’t we heard more about columnar databases lately, especially since their performance "sweet spot" is in the analytics that is now a red-hot concern? Actually, maybe we have – we just haven’t heard the label "columnar" attached to the products.
Unless you've been living in a cave, chances are you've heard about SAP’s HANA. Underlying all that infrastructure, in the long term, is an in-memory database specifically designed to use columnar storage. A logical inference is that this will be complemented if necessary by Sybase IQ – which would fit better than ABAP (SAP’s Advanced Business Application Programming language) or a row-oriented relational disk-based database.
Then there’s HP. Except for a network management solution and a later development-testing solution (when HP acquired Mercury Interactive), HP has been notably reluctant to be seen as a software vendor. Yet not only did HP acquire Vertica last year, it also acquired Autonomy in order to, as it said, infuse Vertica with the ability to handle semi-structured and unstructured data analytics – just like the row-oriented relational giants.
Finally, there’s EMC’s Greenplum. According to the EMC website, Greenplum effectively puts the same veneer over your choice of databases – row-oriented or columnar. If the customer wants pure columnar, the customer gets pure columnar – or so they say.
Grid vs. Appliance
It is also notable that row-oriented and "hybrid" relational is associated more with high-performance "appliance" boxes – like Oracle’s Exadata and IBM’s Netezza – while columnar is associated with massive shared-nothing "grids" like the ones Exasol uses to achieve outstanding performance and price/performance at the same time.
It’s too soon to say whether this will be a permanent split, but that would be a shame – for appliances. Appliances like Exadata already suffer from concerns that their specialized hardware technology may not keep pace with Intel-chip performance increases; ruling out pure columnar would seem to add yet another "future risk" to an appliance.
Bottom Line for IT
The above is a brief overview of a dynamic technology – labeled by some commentators, including me, as "disruptive." Disruptive is upsetting to some vendors, not to customers; on the contrary, disruptive technologies are usually good news for IT buyers.
In particular, columnar databases are now at the point where they can be used for very large-scale analytics, for newly-created data warehouses and data marts and for rapid-implementation cloud, Big Data or "mixed" data processing that includes a medium amount of updates along with reads and queries. Therefore, instead of treating columnar database offerings as "new kids on the block," IT buyers should start to consider them as on their face more attractive than row-oriented relational databases.
In other words, you should begin asking old-style, row-oriented relational databases to prove their worth in your new private-cloud analytics implementations compared to columnar, instead of assuming that row-oriented is the safe choice.
There’s a great deal of potential for improved performance and price/performance in columnar database technology, and IT buyers now have several enterprise-ready products to choose from. If you haven’t put columnar on your business intelligence short lists, now is a good time to do so.
Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. Wayne has been an IT industry analyst for 22 years. During that time, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI, and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.