Big Data, MapReduce, Hadoop, NoSQL: The Relational Technology Behind the Curtain: Page 2
Where Are the Boundaries of Big Data?
The most popular "spearhead" of Big Data, right now, appears to be Hadoop. As noted, it provides a distributed file system "veneer" to MapReduce for data-intensive applications (including Hadoop Common that divides nodes into a master coordinator and slave task executors for file-data access, and Hadoop Distributed File System [HDFS] for clustering multiple machines), and therefore allows parallel scaling of transactions against rich-text data such as some social media data. Hadoop operates by dividing a "task" into "sub-tasks" that it hands out redundantly to back-end servers, which all operate in parallel (conceptually, at least) on a common data store.
As it turns out, there are limits even to Hadoop's eventual-consistency type of parallelism. In particular, it now appears that the metadata which supports recombination of the results of "sub-tasks" must itself be "federated" across multiple nodes for both availability and scalability purposes. In fact, Pervasive Software notes that its own investigations show that using multiple-core "scale-up" nodes for the sub-tasks improves performance compared to proliferating yet more distributed single-processor scale-out servers. In other words, the most scalable system, even in Big Data territory, is one that combines strict and eventual consistency, parallelism and concurrency, distributed and scale-up single-system architectures, and NoSQL and relational technologies.
Solutions like Hadoop are effectively out there "in the cloud" and therefore outside the usual walls of enterprise data centers. Thus, there are fixed and probably permanent physical and organizational boundaries between IT's data stores and those serviced by Hadoop. Moreover, it should be apparent from the above that existing business intelligence and analytics systems will not suddenly convert to Hadoop files and access mechanisms, nor will "mini-Hadoops" suddenly spring up inside the corporate firewall and create havoc with enterprise data governance. The use cases are simply too different.
The remaining boundaries – the ones that should matter to IT buyers – are those between existing relational business intelligence and analytics databases and data stores and Hadoop's file system and files. And here is where "eventual consistency" really matters. The enterprise cannot treat this data as just another business intelligence data source. It differs fundamentally in that the enterprise can be far less sure that the data is current – or even available at all times. So scheduled reporting or business-critical computing based on this data is much more difficult to pull off.
On the other hand, this is data that would otherwise be unavailable for BI or analytics processes – and because of the low-cost approach to building the solution, should be exceptionally low-cost to access. However, pointing the raw data at existing business intelligence tools would be like pointing a fire hose at your mouth, with similarly painful results. Instead, the savvy IT organization will have plans in place to filter the data before it begins to access it.
The Long-Run Bottom Line
The impression given by some marketers is that Hadoop and its ilk are required for Big Data, where Big Data is more broadly defined as most Web-based semi-structured and unstructured data. If that is your impression, I believe it to be untrue. Instead, handling Big Data is likely to require a careful mix of relational and non-relational, data-center and extra-enterprise business intelligence, with relational in-enterprise BI taking the lead role. And as the limits to parallel scalability of Hadoop and the like become more and more evident, the use of SQL-like interfaces and relational databases within Big Data use cases will become more frequent, not less.
Therefore, I believe that Hadoop and its brand of Big Data will always remain a useful but not business-critical adjunct to an overall business intelligence and information management strategy. Instead, users should anticipate that it will take its place alongside relational access to other types of Big Data, and that the key to IT success in Big Data BI will be in intermixing the two in the proper proportions, and with the proper security mechanisms. Hadoop, MapReduce, NoSQL, and Big Data, they're all useful – but only if you pay attention to the relational technology behind the curtain.