Why Open Source Graph Databases Are Catching on
Open source graph databases, first used by social networks like Facebook and Twitter, are seeing mainstream adoption.
Graph databases, which use graph structures for semantic queries, came into prominence through social networks like Facebook and Twitter. But they're used for far more now than just linking connections between friends and relatives. Graph databases give organizations the capability to analyze and understand vast graphs of connected data.
Open Source Graph Databases
Open source graph databases are proving especially popular, as companies increasingly shun proprietary software and vendor lock-in for data management and storage. Open source also gives software developers more flexibility and makes it easier to control up-front costs.
All of the major social networks use open source graph databases. Twitter created the open source FlockDB for managing wide but shallow network graphs. Google's Cayley was inspired by the graph database behind Freebase and its Knowledge Graph, the knowledge base behind its search engine. Facebook uses Apache Giraph, which was built for high scalability.
"Remember back to Alta Vista before Google? Alta Vista was good, but Google was so much better because it actually understood how all the pages on the web linked together," said Quinn Slack, co-founder and CEO of Sourcegraph, which uses a massive open source graph database in its product, a search engine for open source development code.
In this article, we cover:
- What graph databases can do
- How graph databases work, and why they are better suited for certain data tasks than relational or NoSQL databases
- Some companies that are using graph databases, and how they are using them
- The most common use cases for graph databases
- Graph database strengths and weaknesses
- The size of the graph database market and its most prominent vendors
- How companies like Oracle and Microsoft are responding to the graph database trend
- How the open source Apache TinkerPop project is influencing graph database adoption
Graph databases not only find connections between different points of data, but they also can rank the relevance or weight of those relationships.
"Graphs represent a natural way to model data as they tend to allow it to be directly stored in the manner that we think and reason about it in the real world," said Stephen Mallette, vice president of the Apache TinkerPop open source graph database project and a software engineer at DataStax, a company that develops and provides commercial support for an enterprise edition of the NoSQL Cassandra database.
"The value proposition for graphs has spread past the innovators and early adopters at this point," he said.
Relational databases perform poorly on ferreting out relationships. They require modeling data at the start by joining tables through foreign keys. And more joins equals drastically poorer performance, which can make them untenable especially for online applications. It also precludes the flexibility to change with business needs.
Most NoSQL databases -- whether key-value-, document- or column-oriented -- also struggle to link disconnected data and graphs, according to an ebook written by executives from Neo4j, which introduced its open source graph database in 2010 and, like many vendors with products built on open source technology, now offers both community and enterprise editions of the database.
While it can be done, it becomes increasingly difficult as companies move beyond modestly sized operations. Twitter and Facebook, meanwhile, deal with billions of relationships.
Rather than requiring applications to create a network out of disconnected data, graph databases store connected data as connected data.
They store data in individual nodes that represent entities called vertices -- a person, product, piece of data -- and the different relationships between them as edges. One node might hold a product name while another might hold a vendor name, with the relationship between them indicating that the vendor supplies that product.
Some definitions require index-free adjacency, meaning that connected nodes physically "point" to each other in the database.
As a result, "For example, we can ask the graph to find for us all the flavors of ice cream liked by people who enjoy espresso but dislike Brussels sprouts, and who live in a particular neighborhood," the Neo4J authors point out in their ebook.
"Whether we want to understand relationships between customers, elements in a telephone or data center network, entertainment producers and consumers, or genes and proteins, the ability to understand and analyze vast graphs of highly connected data will be key in determining which companies outperform their competitors over the coming decade," they state.
Social networks aren't the only companies using graph databases, of course. Here are some other notable examples of initiatives in which graph databases play a key role:
- Graph database technology helped the International Consortium of Investigative Journalists link connections in the Panama Papers, which involved the leak of 2.6TB of data from Panamanian law firm Mossack Fonseca about hidden offshore accounts. Those connections included couples at the same address who were not married, bank accounts used for money laundering and emails between various people not necessarily named on the accounts.
- Montefiore Medical Center in New York built a data lake based on graph database technology and has teamed up with Mayo Clinic on a predictive algorithm using various physical indicators to predict when patients are likely to have a major adverse event within 48 hours.
- Walmart uses graph technology to generate product recommendations for its online retail operations.