Friday, October 19, 2012

Giraph for Large Multigraphs

I recently looked into using Giraph for algorithms on large multigraphs, graphs that allow multiple edges between two nodes. It turns out that Giraph is indeed a great framework for distributed graph processing on Hadoop, that often gives huge speed-ups compared to MapReduce and is open source. However, as it is a relatively new Apache incubator project (since February), there are certain features that are still missing, and some existing ones that could be better documented.

Giraph Overview

Giraph is a distributed graph processing framework that runs on Hadoop and is designed to run algorithms on really large graphs. But not too large -- Giraph assumes that your entire graph can fit within the total memory of your cluster, by splitting the graph into partitions across the cluster. In practice, this is a reasonable assumption: for example, Giraph is used at Facebook for its social graph.

Giraph's programming model closely follows Pregel (Google), where you "think like a vertex" when implementing an algorithm. It uses the Bulk Synchronous Parallel (BSP) model for distributed synchronization; an algorithm consists of a series of supersteps separated by synchronization barriers.

Giraph is an alternative to MapReduce, currently implemented as one giant Hadoop Map phase. By keeping the entire graph in memory, Giraph avoids the expensive disk reads and writes associated with each MapReduce iteration that occurs if the same algorithm were implemented in normal MapReduce. Instead of a MapReduce shuffle, at each iteration vertices message each other over the network via Netty. Therefore, algorithms with many iterations, such as PageRank, will achieve the most performance gain.

For more details, please see the Giraph talk (video) from Hadoop Summit 2012.

What's Missing from Giraph

Multigraph Support

Giraph does not fully support multigraphs; GIRAPH-141 is the relevant Jira issue. However, there is a recommended workaround that is discussed on a Giraph wiki page. It involves tracking multiple edges between two vertices by using an ArrayListWritable class as the Edge and Message types.

List of Edges, RDF as Input

One input format that Giraph does not support is a list of edges. Unlike the adjacency list format which has one vertex and all its out edges per line, in this case there is one edge per line. That is, the information for one vertex is split across multiple lines. RDF is an example of this format. A workaround is discussed in GIRAPH-170, which recommends preprocessing the input with Pig or MapReduce, to convert the input into a format that is supported.

Data Locality

To get the best performance on a really large graph, it is desirable to put the worker processes on the same node as the data that they will compute on, a central idea in Hadoop MapReduce. This data locality is not completely implemented in Giraph. GIRAPH-275 discusses this:
This is "poor man's locality" since we have no control over where Hadoop puts our workers. In Hadoop, they can go where the data is given the location info we have to work with. In Giraph, the workers go wherever Hadoop puts them and we are just hoping some data blocks are also local. 
The ultimate locality fix will involve major overhauls to how Giraph gets InputFormat/RecordReader and does task submission but might be too much until we're on YARN (or not worth it until we're on YARN.)
The patch from GIRAPH-275 (to be released in 0.2.0) does alleviate this problem somewhat.