Overview
I've been reading about
GraphX, Spark's graph processing library. GraphX provides distributed, in-memory graph computing. The key thing that differentiates it from other large-scale graph processing sytems, like Giraph and GraphLab, is that it is tightly integrated within the Spark ecosystem. This allows efficient data pipelines that combine ETL (SQL), machine learning, and graph analysis within one framework (Spark), without the overhead of running multiple systems and copying data between them.
|
The Spark stack. |
Graph Library for the Spark Framework
Graphs in GraphX are directed multigraph property graphs, which means that each vertex and each edge can have properties (attributes) associated with it. GraphX graphs are distributed and immutable. You create a graph in GraphX by providing an RDD of vertices and an RDD of edges. You can then perform OLAP operations on a graph through the
API. A pregel API supports vertex-centric, bulk-synchronous parallel, iterative algorithms.
In-memory indexes speed up graph operations. Edge partitioning (which means vertices can be split across partitions) and vertex data replication speed up edge traversal, which usually involves communication across machines. A 2014
research paper shows performance comparable to other graph systems, Giraph and GraphLab.
|
GraphX is built on RDDs. |
Applications
A couple of recent MLLib algorithms are implemented on GraphX:
LDA topic modeling and
Power Iteration Clustering. Alibaba Taobao uses GraphX for
data mining in ecommerce, modeling user-item-merchant interactions as a graph. Netflix uses GraphX for
movie recommendation, with graph diffusion and LDA clustering algorithms.