Why Open Source Graph Databases Are Catching on: Page 2
In this article, we cover:
- What graph databases can do
- How graph databases work, and why they are better suited for certain data tasks than relational or NoSQL databases
- Some companies that are using graph databases, and how they are using them
- The most common use cases for graph databases
- Graph database strengths and weaknesses
- The size of the graph database market and its most prominent vendors
- How companies like Oracle and Microsoft are responding to the graph database trend
- How the open source Apache TinkerPop project is influencing graph database adoption
Graph databases aren't the answer to every business problem. At Geisinger Health System, which has become known for its deep dive into health care analytics, Chief Data Officer Nicholas Marko, MD, said standard BI tools are adequate for 80 percent of the organization’s needs.
The most logical graph database use cases are when understanding relationships and their strength is paramount, and the need for performance, flexibility and reduced latency outstrip the capabilities of batch processing of aggregates.
For instance, detecting credit card fraud requires comparing purchases on the card with the card holder's normal buying patterns. In this case, the ability to flag suspicious activity in real time becomes vital.
"I think that you generally want to look to graph databases when the data complexity is high and when there is high value in the relationships within the data," said DataStax's Mallette.
"A graph will really shine under these conditions, because the data modeling is intuitive and relationships are considered first-class citizens."
He added, "Since relationships are first-class citizens, the entities (domain objects) in the graph become straightforward to connect and traverse to arbitrary depth, thus allowing for complex reasoning over the data that would be otherwise quite difficult with an [relational] or other NoSQL database."
Graph database strengths include:
- Performance. While query performance degrades quickly in relational databases with the number of joins, graph database performance remains fairly constant as the dataset grows.
- Flexibility. New kinds of relationships, nodes, labels and subgraphs can be added without interfering with existing queries and applications.
- Agility. The schema-free nature of the graph data model means the data model can evolve in concert with iterative software delivery practices.
The issues users face with graph databases include partitioning and density. In a distributed environment, massive graphs are farmed out across a multi-machine compute cluster. Market leaders are addressing the need to limit cross-machine communication by putting information often retrieved together on the same machine.
And if a customer has bought a lot of products at a shopping site, that creates a dense graph, much of which might be irrelevant information to the query at hand. The database needs to filter very specifically to each unique query.
Graph databases are not good at quickly doing global aggregations, Mallette said, though the open source TinkerPop helps mitigate the problems.
As an example, he said, "to simply count all the vertices in a graph requires iterating over every vertex in a graph. In a graph of billions of vertices, that can take a long time and if you were doing more than a count -- finding all the 'product' vertices then traversing to 'sale' vertices to calculate 'sales by month' -- it would likely become even more expensive.
Using a graph database for applications with those kinds of requirements "will present a weak spot that you will have to be aware of," Mallette said. "The problem is not insurmountable, but if most of your application requires this type of analysis in real-time, you might need to reconsider your graph model or, in some cases, consider other data storage approaches."
Though IDC expects the overall database market to reach $50 billion by 2017, graph databases make up only a sliver of that. Forrester Research projects that 25 percent of enterprises will use graph databases by 2017.
Neo4j, the most popular graph database, comes in at No. 21 on the overall DB-Engines ranking. OrientDB, an open source document/graph hybrid, is the second most popular graph database, followed by Titan, an open source project used in DataStax and other offerings.
Neo4j recently released version 3.0, with an architecture overall primarily focused on a new data store; graph-native storage being another issue in this market.
All the major database players, even those best known for their proprietary software, have released or are working on graph capabilities.
Microsoft CEO Satya Nadella recently cited LinkedIn's graph technology as one of its most attractive features prompting the $26.2 billion acquisition. Microsoft has also been working on Graph Engine, a distributed, in-memory, large graph processing engine.
Oracle in March released Parallel Graph Analytics (PGX) v1.2, employing parallelism to increase performance, a new query language for graph pattern matching, and a new algorithm and APIs to help you build a recommendation engine on top of your graph.
Cloud market leader Amazon Web Services' NoSQL DynamoDB offers a Titan plug-in for graphs.
DataStax and IBM recently announced commercial products built on TinkerPop, which attained top-level project status in May with the Apache Software Foundation.
TinkerPop is an open source graph computing framework for both real-time, transactional graph databases (OLTP) and batch analytic graph processors (OLAP). It can be used for small graphs on a single machine or massive graphs that require a distributed environment.
The project is focused on creating industry standards for graph databases, including a standard language, which it calls Gremlin. Its Gremlin traversal machine, meanwhile, is designed to work across languages.
TinkerPop likely plays a role in the growing interest in graphs, Mallette said.
"Without a project like TinkerPop, the graph database world would be quite fragmented. Every graph system would have its own API, its own method for doing queries and no simple methods for integration. That fragmentation would look like a risky technology choice. Imagine what the relational database market would look like without JDBC [the API in Java for accessing a database]," he said. "TinkerPop alleviates that risk by unifying the APIs for interacting with a graph system, making it possible to avoid vendor lock-in and lower the learning curve across all graphs."
Susan Hall has been a journalist for more than 20 years at news outlets including the Seattle Post-Intelligencer, Dallas Times Herald and MSNBC.com. She writes for The New Stack and FierceHealthIT, among other publications.