Using NoSQL Databases to Handle Fast Data
Updated · May 10, 2016
WHAT WE HAVE ON THIS PAGE
NoSQL databases have grown markedly in popularity over the last few years, and for good reasons. Broadly speaking, NoSQL is proving especially useful in the following two areas of increasing interest to IT organizations:
- Scaling to handle massive numbers of transactions when it is difficult to do so without relaxing relational databases’ strict requirements of immediate consistency with existing data. Since this is particularly true in public clouds with applications such as Facebook, NoSQL databases are usually associated with Hadoop, MapReduce and document/file rather than table ways of storing the data.
- Delivering “almost-real-time” performance for the large amounts of distributed transactions (particularly writes) associated with Fast Data, such as the Internet of Things (IoT). Here, NoSQL databases are particularly associated with Apache Spark, an open source distributed-cluster data interface well suited to scaling almost-real-time access to data. One of the key benefits of the best Fast Data-friendly NoSQL databases is that they allow users to tune the tradeoff between scalability and data quality dynamically.
So what are NoSQL databases exactly? And what are the key ways that users can employ them to deliver maximum value in the two cases outlined above?
NoSQL Database Is Not NoSQL
One of the first typical tasks of explaining NoSQL databases is to clarify their very misleading “NoSQL” title. NoSQL databases do not seek to crowd out relational databases and their existing uses, nor do they refuse to support the SQL query language developed for relational databases.
On the contrary, SQL is increasingly being implemented in existing NoSQL databases, and a key part of today’s Fast Data movement is the effort to clarify ways in which NoSQL databases can serve as front ends to existing relational analytics systems, performing almost-real-time transactional and analytical tasks and assigning relational databases the deeper analytics on less fresh data that they have always performed.
A reasonable definition of NoSQL databases is that NoSQL databases use different storage structures than relational tables (such as files, documents) in order to scale where relational databases typically cannot.
Most such cases of relational inability to scale are cases where the requirements of ensuring that data is always consistent and not lost (referred to typically as ACID, or atomicity/consistency/integrity/durability) need to be relaxed.
Therefore, most NoSQL databases have another characteristic unlike most if not all relational systems: They allow relaxation of the ACID requirements. In fact, at least until recently, many NoSQL databases did not support ACID at all. Hence, initial data quality for most NoSQL databases is significantly worse than that of relational database systems.
Over the course of the last decade or so Facebook, Google and Amazon, among others, have learned how to handle this problem adequately in clouds with complementary relational databases. Nevertheless, it is important for IT to recognize and plan for this potential problem up front.
Finally, the popularity of the misleading NoSQL tag has allowed companies to shoehorn pre-Facebook non-relational databases into that marketing category. In particular, document-handling databases from companies such as MarkLogic are now frequently classified as NoSQL databases; after all, they don’t store their data as tables, but as documents.
I have no quarrel with MarkLogic being classified as a NoSQL database according to the strict definition of the term, and the latest marketing numbers cited by Wikipedia say it is the vendor with the most revenue in the NoSQL market. However, the focus of this article is on the new applications that make NoSQL of greater use to IT, and there MarkLogic’s focus on documents is less likely to be useful in meeting the full range of IT needs.
Emerging Best Practices for NoSQL Databases
The conclusions drawn above lead us to a few simple rules about effective NoSQL implementation:
- In many if not most cases, the NoSQL database should be used as a complement to an existing or additional relational database that at the least handles deeper post-arrival data analytics.
- All else being equal, a NoSQL database that offers a broader range of “ACID relaxation” — if possible, all the way from no consistency to near-ACID-compliance — is better than one that only allows no consistency. IT should plan how it will use and tune that “control knob” up front.
- If the application is aimed at Fast Data, IT should emphasize support for in-memory computing. That typically means, among other things, implementing and using Apache Spark.
I will also add a few suggestions that are less well established as part of a highly effective implementation – understandably, since the Fast Data market only began to take off less than two years ago, and thus “best practices” in that area are not fully developed:
- “Data governance,” as it is now typically called, should often be implemented at the beginning. One reason is that another hot topic with a misleading title, “data lakes,” involves creating a pool of data that is not subject to the typical ETL (extract-transform-load) data cleansing of today’s analytics database architectures. Data lakes need data governance, and it is likely that some NoSQL data will move immediately into a data lake. Better to ensure NoSQL data governance compliance now, rather than create a situation in which data lake governance becomes ineffective.
- Even for quick-hit analytics, data needs metadata, and that metadata should bridge NoSQL and relational databases. In my opinion, the best way to do this is via data virtualization software such as that available from Cisco or Denodo. Or, users can piggyback on existing global metadata repositories. Again, IT should face this problem up front.
- In real estate, they say, the most important thing is location, location, location. In the new use cases of NoSQL databases, the most important thing is performance scalability, performance scalability, performance scalability. That means not only planning for almost-real-time performance scaling over the next one or two years, but also taking into account vendor plans and stability three to four years out.
Three NoSQL Databases to Check Out
The following NoSQL databases are not necessarily the best. However, they are close to the best in their area as of now, and they tend to cover a broad range of the new use cases that we are discussing. At the least, they provide an excellent way to kick the tires of NoSQL databases in general.
This one is a no-brainer. Not only is the open source MongoDB a market leader, it has “word of mouth” marketing clout, and it is frequently used in the public clouds driving adoption of the new NoSQL database areas.
MongoDB’s underlying data storage structure is Internet-document, using a JSON-like format and dynamic schemas. (In other words, it may be more flexible than relational systems in adapting to new data types, and it is potentially better at handling non-structured data.) MongoDB is an exceptionally popular open source database, and therefore it is potentially easier to integrate with public cloud database architectures.
Some questions have been raised about MongoDB’s performance scalability compared to, say, Redis. However, its wide use and zero cost make it a bit like Linux in Linux’s early years; likely to spawn a de-facto standard and supporting software that will improve performance scalability and add necessary components for easier integration with relational systems.
It may seem odd of me to pick DataStax rather than Apache Cassandra, since Cassandra is more “popular” according to Wikipedia. However, Datastax is in fact Cassandra-plus: an enterprise distribution of the open source Cassandra NoSQL database, plus extensions for analytics using Apache Spark and search using Apache Solr.
In other words, DataStax provides Apache Spark for Fast Data implementations “baked in.” And then there’s the fact that DataStax is one of the NoSQL database market revenue leaders.
What about the Cassandra end of DataStax? Apache Cassandra is of course open source, and was designed for distributed database architectures. It is an unusual database in that it uses columnar database technology in some cases, and because it promises (and apparently delivers) linear scalability without downtime when a new machine is added. It also allows the user to tune between full database consistency and “let her rip” no-consistency performance.
One potential area of concern for Fast Data implementers is that in some cases, at least, it appears performance scalability may come at the expense of the ability to actually get at the data immediately (“read and write latency”).
In some ways, Redis Labs’ Redis NoSQL database may be the most immediately useful of the three databases I have cited. Redis is open source, presently being further developed by Redis Labs and is designed as an in-memory database management system. In fact, in some implementations Redis turf is main memory and NVRAM, while relational systems handle most if not all transactions and analytics involving disk.
The key Redis advantage is exceptional performance for in-memory database applications, due especially to its focus on in-memory transaction handling. However, because of its integration with relational systems and the increasing relative price-performance advantage of NVRAM compared to disk, one use case as of 2015 achieved performance scalability involving terabytes of data.
Therefore, while Redis might seem at first blush to be especially useful in almost-real-time Fast Data implementations, it is also increasingly appropriate in massive-numbers-of-transactions Big Data implementations as well. Although apparently not to the degree of Cassandra, Redis does offer a fair amount of “consistency tuning.”
Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. An IT industry analyst for 22 years, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.