Monday, December 10, 2012

Scalable Graph Processing: Comparing Giraph, Titan, Faunus

A couple of open-source, scalable graph processing frameworks were released this year: Giraph, Titan, and Faunus. All of them support distributed processing of graphs across a cluster of multiple machines. What then is the difference between them? Actually, there are huge differences related to the types of queries they support, as well as how graphs are stored and their APIs.

Titan, for concurrent transaction processing (OLTP) 

Titan is optimized for handling thousands of concurrent transactions against massive, dynamic, complex graphs. As such, it really fills a niche in the distributed graph landscape. Titan works best with multiple short, local queries, that start at one vertex and query the local neighborhood around it, as opposed to global queries against an entire graph. An example use case would be a graph representation of a social network like Twitter, where a node represented either a user or a tweet, and edges linked a user to the users and tweets that they followed. In this case, the concurrent transactions would include reads, such as a user reading their tweet stream, as well as writes: any new tweet or user, or a new "follow" edge. In fact, this experiment by Aurelius shows how Titan achieves up to ten thousand transactions per second on a six-node EC2 cluster.

Titan's modular design: Cassandra/HBase and Gremlin

One big plus for Titan is instead of reinventing the wheel, it builds upon existing open-source projects. It cleanly abstracts out the distributed data store layer and thus can leverage either Cassandra or HBase for the underlying storage backend. On the application side, Titan implements the existing Tinkerpop Blueprints graph API, so that any existing libraries built to Blueprints will run on Titan. In particular, this means that one can quickly get started programming Titan through Gremlin, a popular graph DSL built on Groovy.



Giraph, for batch processing (OLAP)

Giraph, on the other hand, is optimized for batch processing of massive graphs. It was designed to speed up global graph analyses, especially iterative algorithms - like PageRank - that start from a set of vertices and take a series of adjacent hops out from those starting points, potentially touching all vertices in the graph. An example use case would be a graph representation of a social network, where an analysis might involve ranking all the people in the network based on their popularity with a link-based algorithm like PageRank.

Giraph keeps the entire graph in memory; Java API

One of Giraph's main selling points is that it loads an entire graph into a cluster's memory, that is, it splits up a graph into partitions and spreads those partitions across the memory of nodes in a cluster, and keeps them there throughout the duration of an analysis job. Therefore, iterative algorithms that constantly reference graph data run faster by not having to access them from disk. Of course, the tradeoff is that the size of the graph that can be processed by Giraph is limited by the total memory in a cluster. Giraph applications are programmed through a Java API. My previous blog post covers more details about Giraph.

Faunus, for batch processing (OLAP)

Faunus, a companion project to Titan, is more comparable to Giraph, in that it was also designed for batch queries. What Faunus does best is batch analytics on massive graphs. One example would be calculating the vertex degree distribution of a graph, which would involve scanning through all its vertices to get their degree and then counting up the degree distribution.

Faunus runs Hadoop MapReduce; Interactive Gremlin shell

Batch analytics in Faunus leverages the parallelism of Hadoop MapReduce. Like Titan, Faunus can be programmed through the interactive Gremlin shell, but, unlike Titan, Faunus converts each Gremlin query into a set of MapReduce jobs. The graph data is stored on disk, distributed across a cluster on HDFS, the Hadoop distributed file system, in a compressed, binary format called SequenceFile format, allowing support for massive graphs. With the MapReduce conversion and interactive shell, Faunus is a lot like Pig, except that it is programmed in Gremlin, a graph-specific DSL.