Big Data, MapReduce, Hadoop, NoSQL: The Relational Technology Behind the Curtain
Updated · Oct 20, 2011
One of the more interesting features of vendors’ recent marketing push to sell business intelligence and analytics is the emphasis on the notion of Big Data, often associated with NoSQL, Google MapReduce, and Apache Hadoop, but without a clear explanation of what these are, and where they are useful. It is as if we were back in the days of “checklist marketing,” where the aim of a vendor like IBM or Oracle was to convince you that if competitors’ products didn’t support a long list of features, those competitors would not provide you with the cradle-to-grave support you needed to survive computing’s fast-moving technology.
As it turned out, many of those features were unnecessary in the short run, and a waste of money in the long run; remember rules-based AI? Or so-called standard Unix? The technology in those features was later to be used quite effectively in other, more valuable pieces of software, but the value-add of the actual feature itself turned out to be illusory.
As it turns out, we are not back in those days, and Big Data via Hadoop and NoSQL does indeed have a part to play in scaling Web data. However, I find that IT buyer misunderstandings of these concepts may indeed lead to much wasted money, not to mention serious downtime. These misunderstandings stem from a common source: marketing’s failure to explain how Big Data relates to the relational databases that have fueled almost all data analysis and data-management scaling for the last 25 years.
It resembles the scene in Wizard of Oz where a small man, trying to sell himself as a powerful wizard by manipulating stage machines from behind a curtain, becomes so wrapped up in the production that when someone notes “There’s a man behind the curtain” the man shouts “Pay no attention to the man behind the curtain!” In this case, marketers are so busy shouting about the virtues of Big Data related to new data management tools and “NoSQL” that they fail to note the extent to which relational technology is complementary to, necessary to, or simply the basis of, the new features.
So here is my understanding of the present state of the art in Big Data, and the ways in which IT buyers should and should not seek to use it as an extension of their present (relational) business intelligence and information management capabilities. As it turns out, when we understand both the relational technology behind the curtain and the ways it has been extended, we can do a much better job of applying Big Data to long-term IT tasks.
NoSQL or NoREL?
The best way to understand the place of Hadoop in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that is constantly being bombarded by transactions – and often, another transaction on a piece of data arrives before the first is finished. To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed at the same time; and concurrency, in which a processor switches between the two rapidly in the middle of the transaction.
Pure parallelism is obviously faster, but to avoid inconsistencies in the results of the transaction, you often need coordinating software, and that coordinating software is hard to operate in parallel, because it involves frequent communication between the parallel “threads” of the two transactions. At a global level (like that of the Internet), the choice now translates into a choice between “distributed” and “scale-up” single-system processing.
As it happens, back in graduate school I did a calculation of the relative performance merits of tree networks of microcomputers versus machines with a fixed number of parallel processors, which provided some general rules that are still applicable. There are two key factors that are relevant here: “data locality” and “number of connections used.” This means that you can get away with parallelism if, say, you can operate on a small chunk of the overall data stored on each node, and if you don’t have to coordinate too many nodes at one time.
Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of scale-out servers that were set up to handle transactions in parallel. This had obvious cost advantages, since PCs were far cheaper. But data locality was a serious problem in trying to scale, since even when data was partitioned correctly in the beginning between clusters of PCs, over time data copies and data links proliferated, requiring more and more coordination. Meanwhile, in the high performance computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination “tricks,” even when, by choosing the transaction type carefully, the user could minimize the need for coordination.
In certain instances, however, relational databases designed for “scale-up” systems and structured data did even less well. For indexing and serving massive amounts of “rich text” (text plus graphics, audio and video), for streaming media, and of course for HPC, a relational database would insist on careful consistency between data copies in a distributed configuration and so could not squeeze the last ounce of parallelism out of these transaction streams. And so, to minimize costs and to maximize the parallelism of these types of transactions, Google, the open source movement, and various others turned to MapReduce, Hadoop and various other non-relational approaches.
These efforts combined open-source software (typically related to Apache), large amounts of small or PC-type servers and a loosening of consistency constraints on the distributed transactions (an approach called eventual consistency). The basic idea was to minimize coordination by identifying types of transactions where it didn’t matter if some users got “old data” rather than the latest data, or if some users got an answer while others didn’t.
How well does this approach work? A recent communication from Pervasive Software (about an upcoming conference) noted a study of one implementation which found 60 instances of unexpected unavailability “interruptions” in 500 days. This is certainly not up to the standards of the typical business-critical operational database, but is also not an overriding concern to today’s Hadoop users.
The eventual consistency part of this overall effort has sometimes been called NoSQL. However, Wikipedia notes that in fact it might correctly be called NoREL, meaning “for situations where relational is not appropriate.” In other words, Hadoop and the like by no means exclude all relational technology, and many of them concede that relational “scale-up” databases are more appropriate in some cases even within the broad overall category of Big Data (i.e., rich-text Web data and HPC data). Indeed, some implementations provide extended-SQL or SQL-like interfaces to these non-relational databases.
Wayne Kernochan of Infostructure Associates has been an IT industry analyst focused on infrastructure software for more than 20 years.