Showing posts with label Giraph. Show all posts
Showing posts with label Giraph. Show all posts

Saturday, May 23, 2015

GraphX: Graph Computing for Spark

Overview

I've been reading about GraphX, Spark's graph processing library. GraphX provides distributed, in-memory graph computing. The key thing that differentiates it from other large-scale graph processing sytems, like Giraph and GraphLab, is that it is tightly integrated within the Spark ecosystem. This allows efficient data pipelines that combine ETL (SQL), machine learning, and graph analysis within one framework (Spark), without the overhead of running multiple systems and copying data between them.

The Spark stack.

Graph Library for the Spark Framework

Graphs in GraphX are directed multigraph property graphs, which means that each vertex and each edge can have properties (attributes) associated with it. GraphX graphs are distributed and immutable. You create a graph in GraphX by providing an RDD of vertices and an RDD of edges. You can then perform OLAP operations on a graph through the API. A pregel API supports vertex-centric, bulk-synchronous parallel, iterative algorithms.

In-memory indexes speed up graph operations. Edge partitioning (which means vertices can be split across partitions) and vertex data replication speed up edge traversal, which usually involves communication across machines. A 2014 research paper shows performance comparable to other graph systems, Giraph and GraphLab.

GraphX
GraphX is built on RDDs.

Applications

A couple of recent MLLib algorithms are implemented on GraphX: LDA topic modeling and Power Iteration Clustering. Alibaba Taobao uses GraphX for data mining in ecommerce, modeling user-item-merchant interactions as a graph. Netflix uses GraphX for movie recommendation, with graph diffusion and LDA clustering algorithms.


Monday, December 10, 2012

Scalable Graph Processing: Comparing Giraph, Titan, Faunus

A couple of open-source, scalable graph processing frameworks were released this year: Giraph, Titan, and Faunus. All of them support distributed processing of graphs across a cluster of multiple machines. What then is the difference between them? Actually, there are huge differences related to the types of queries they support, as well as how graphs are stored and their APIs.

Titan, for concurrent transaction processing (OLTP) 

Titan is optimized for handling thousands of concurrent transactions against massive, dynamic, complex graphs. As such, it really fills a niche in the distributed graph landscape. Titan works best with multiple short, local queries, that start at one vertex and query the local neighborhood around it, as opposed to global queries against an entire graph. An example use case would be a graph representation of a social network like Twitter, where a node represented either a user or a tweet, and edges linked a user to the users and tweets that they followed. In this case, the concurrent transactions would include reads, such as a user reading their tweet stream, as well as writes: any new tweet or user, or a new "follow" edge. In fact, this experiment by Aurelius shows how Titan achieves up to ten thousand transactions per second on a six-node EC2 cluster.

Titan's modular design: Cassandra/HBase and Gremlin

One big plus for Titan is instead of reinventing the wheel, it builds upon existing open-source projects. It cleanly abstracts out the distributed data store layer and thus can leverage either Cassandra or HBase for the underlying storage backend. On the application side, Titan implements the existing Tinkerpop Blueprints graph API, so that any existing libraries built to Blueprints will run on Titan. In particular, this means that one can quickly get started programming Titan through Gremlin, a popular graph DSL built on Groovy.



Giraph, for batch processing (OLAP)

Giraph, on the other hand, is optimized for batch processing of massive graphs. It was designed to speed up global graph analyses, especially iterative algorithms - like PageRank - that start from a set of vertices and take a series of adjacent hops out from those starting points, potentially touching all vertices in the graph. An example use case would be a graph representation of a social network, where an analysis might involve ranking all the people in the network based on their popularity with a link-based algorithm like PageRank.

Giraph keeps the entire graph in memory; Java API

One of Giraph's main selling points is that it loads an entire graph into a cluster's memory, that is, it splits up a graph into partitions and spreads those partitions across the memory of nodes in a cluster, and keeps them there throughout the duration of an analysis job. Therefore, iterative algorithms that constantly reference graph data run faster by not having to access them from disk. Of course, the tradeoff is that the size of the graph that can be processed by Giraph is limited by the total memory in a cluster. Giraph applications are programmed through a Java API. My previous blog post covers more details about Giraph.

Faunus, for batch processing (OLAP)

Faunus, a companion project to Titan, is more comparable to Giraph, in that it was also designed for batch queries. What Faunus does best is batch analytics on massive graphs. One example would be calculating the vertex degree distribution of a graph, which would involve scanning through all its vertices to get their degree and then counting up the degree distribution.

Faunus runs Hadoop MapReduce; Interactive Gremlin shell

Batch analytics in Faunus leverages the parallelism of Hadoop MapReduce. Like Titan, Faunus can be programmed through the interactive Gremlin shell, but, unlike Titan, Faunus converts each Gremlin query into a set of MapReduce jobs. The graph data is stored on disk, distributed across a cluster on HDFS, the Hadoop distributed file system, in a compressed, binary format called SequenceFile format, allowing support for massive graphs. With the MapReduce conversion and interactive shell, Faunus is a lot like Pig, except that it is programmed in Gremlin, a graph-specific DSL.

Friday, October 19, 2012

Giraph for Large Multigraphs

I recently looked into using Giraph for algorithms on large multigraphs, graphs that allow multiple edges between two nodes. It turns out that Giraph is indeed a great framework for distributed graph processing on Hadoop, that often gives huge speed-ups compared to MapReduce and is open source. However, as it is a relatively new Apache incubator project (since February), there are certain features that are still missing, and some existing ones that could be better documented.

Giraph Overview

Giraph is a distributed graph processing framework that runs on Hadoop and is designed to run algorithms on really large graphs. But not too large -- Giraph assumes that your entire graph can fit within the total memory of your cluster, by splitting the graph into partitions across the cluster. In practice, this is a reasonable assumption: for example, Giraph is used at Facebook for its social graph.

Giraph's programming model closely follows Pregel (Google), where you "think like a vertex" when implementing an algorithm. It uses the Bulk Synchronous Parallel (BSP) model for distributed synchronization; an algorithm consists of a series of supersteps separated by synchronization barriers.

Giraph is an alternative to MapReduce, currently implemented as one giant Hadoop Map phase. By keeping the entire graph in memory, Giraph avoids the expensive disk reads and writes associated with each MapReduce iteration that occurs if the same algorithm were implemented in normal MapReduce. Instead of a MapReduce shuffle, at each iteration vertices message each other over the network via Netty. Therefore, algorithms with many iterations, such as PageRank, will achieve the most performance gain.

For more details, please see the Giraph talk (video) from Hadoop Summit 2012.

What's Missing from Giraph

Multigraph Support

Giraph does not fully support multigraphs; GIRAPH-141 is the relevant Jira issue. However, there is a recommended workaround that is discussed on a Giraph wiki page. It involves tracking multiple edges between two vertices by using an ArrayListWritable class as the Edge and Message types.

List of Edges, RDF as Input

One input format that Giraph does not support is a list of edges. Unlike the adjacency list format which has one vertex and all its out edges per line, in this case there is one edge per line. That is, the information for one vertex is split across multiple lines. RDF is an example of this format. A workaround is discussed in GIRAPH-170, which recommends preprocessing the input with Pig or MapReduce, to convert the input into a format that is supported.

Data Locality

To get the best performance on a really large graph, it is desirable to put the worker processes on the same node as the data that they will compute on, a central idea in Hadoop MapReduce. This data locality is not completely implemented in Giraph. GIRAPH-275 discusses this:
This is "poor man's locality" since we have no control over where Hadoop puts our workers. In Hadoop, they can go where the data is given the location info we have to work with. In Giraph, the workers go wherever Hadoop puts them and we are just hoping some data blocks are also local. 
The ultimate locality fix will involve major overhauls to how Giraph gets InputFormat/RecordReader and does task submission but might be too much until we're on YARN (or not worth it until we're on YARN.)
The patch from GIRAPH-275 (to be released in 0.2.0) does alleviate this problem somewhat.