Saturday, May 23, 2015

GraphX: Graph Computing for Spark

Overview

I've been reading about GraphX, Spark's graph processing library. GraphX provides distributed, in-memory graph computing. The key thing that differentiates it from other large-scale graph processing sytems, like Giraph and GraphLab, is that it is tightly integrated within the Spark ecosystem. This allows efficient data pipelines that combine ETL (SQL), machine learning, and graph analysis within one framework (Spark), without the overhead of running multiple systems and copying data between them.

The Spark stack.

Graph Library for the Spark Framework

Graphs in GraphX are directed multigraph property graphs, which means that each vertex and each edge can have properties (attributes) associated with it. GraphX graphs are distributed and immutable. You create a graph in GraphX by providing an RDD of vertices and an RDD of edges. You can then perform OLAP operations on a graph through the API. A pregel API supports vertex-centric, bulk-synchronous parallel, iterative algorithms.

In-memory indexes speed up graph operations. Edge partitioning (which means vertices can be split across partitions) and vertex data replication speed up edge traversal, which usually involves communication across machines. A 2014 research paper shows performance comparable to other graph systems, Giraph and GraphLab.

GraphX
GraphX is built on RDDs.

Applications

A couple of recent MLLib algorithms are implemented on GraphX: LDA topic modeling and Power Iteration Clustering. Alibaba Taobao uses GraphX for data mining in ecommerce, modeling user-item-merchant interactions as a graph. Netflix uses GraphX for movie recommendation, with graph diffusion and LDA clustering algorithms.