Monday, December 10, 2012

Scalable Graph Processing: Comparing Giraph, Titan, Faunus

A couple of open-source, scalable graph processing frameworks were released this year: Giraph, Titan, and Faunus. All of them support distributed processing of graphs across a cluster of multiple machines. What then is the difference between them? Actually, there are huge differences related to the types of queries they support, as well as how graphs are stored and their APIs.

Titan, for concurrent transaction processing (OLTP) 

Titan is optimized for handling thousands of concurrent transactions against massive, dynamic, complex graphs. As such, it really fills a niche in the distributed graph landscape. Titan works best with multiple short, local queries, that start at one vertex and query the local neighborhood around it, as opposed to global queries against an entire graph. An example use case would be a graph representation of a social network like Twitter, where a node represented either a user or a tweet, and edges linked a user to the users and tweets that they followed. In this case, the concurrent transactions would include reads, such as a user reading their tweet stream, as well as writes: any new tweet or user, or a new "follow" edge. In fact, this experiment by Aurelius shows how Titan achieves up to ten thousand transactions per second on a six-node EC2 cluster.

Titan's modular design: Cassandra/HBase and Gremlin

One big plus for Titan is instead of reinventing the wheel, it builds upon existing open-source projects. It cleanly abstracts out the distributed data store layer and thus can leverage either Cassandra or HBase for the underlying storage backend. On the application side, Titan implements the existing Tinkerpop Blueprints graph API, so that any existing libraries built to Blueprints will run on Titan. In particular, this means that one can quickly get started programming Titan through Gremlin, a popular graph DSL built on Groovy.



Giraph, for batch processing (OLAP)

Giraph, on the other hand, is optimized for batch processing of massive graphs. It was designed to speed up global graph analyses, especially iterative algorithms - like PageRank - that start from a set of vertices and take a series of adjacent hops out from those starting points, potentially touching all vertices in the graph. An example use case would be a graph representation of a social network, where an analysis might involve ranking all the people in the network based on their popularity with a link-based algorithm like PageRank.

Giraph keeps the entire graph in memory; Java API

One of Giraph's main selling points is that it loads an entire graph into a cluster's memory, that is, it splits up a graph into partitions and spreads those partitions across the memory of nodes in a cluster, and keeps them there throughout the duration of an analysis job. Therefore, iterative algorithms that constantly reference graph data run faster by not having to access them from disk. Of course, the tradeoff is that the size of the graph that can be processed by Giraph is limited by the total memory in a cluster. Giraph applications are programmed through a Java API. My previous blog post covers more details about Giraph.

Faunus, for batch processing (OLAP)

Faunus, a companion project to Titan, is more comparable to Giraph, in that it was also designed for batch queries. What Faunus does best is batch analytics on massive graphs. One example would be calculating the vertex degree distribution of a graph, which would involve scanning through all its vertices to get their degree and then counting up the degree distribution.

Faunus runs Hadoop MapReduce; Interactive Gremlin shell

Batch analytics in Faunus leverages the parallelism of Hadoop MapReduce. Like Titan, Faunus can be programmed through the interactive Gremlin shell, but, unlike Titan, Faunus converts each Gremlin query into a set of MapReduce jobs. The graph data is stored on disk, distributed across a cluster on HDFS, the Hadoop distributed file system, in a compressed, binary format called SequenceFile format, allowing support for massive graphs. With the MapReduce conversion and interactive shell, Faunus is a lot like Pig, except that it is programmed in Gremlin, a graph-specific DSL.

Saturday, November 10, 2012

Mervin Kelly on Innovation


Long before Jack Dorsey and Steve Jobs reflected on the keys to innovation, Mervin Kelly, one of the early architects of Bell Labs, weighed in on the subject. I learned this from Jon Gertner's "The Idea Factory", a fascinating account of Bell Labs' age of innovation. In particular, Chapter 9, "The Formula" describes a speech that Kelly gave to the Royal Society in London in 1950, shortly after the invention of the transistor at Bell Labs, which highlighted his thoughts on innovation. At this time Bell Labs was widely considered to be the most innovative industrial research lab in the world. After years as a physicist and then director of research, Kelly was by then an executive vice president at Bell Labs.

Interestingly, the formula for innovation does not seem to have changed much over the years. Kelly considered these three ingredients essential to innovation:

Bring together a multi-disciplinary team of smart people. Kelly encouraged the constant exchange of ideas between researchers, engineers, and technicians. These exchanges could be formal or informal; they could take place in the hallway, over lunch, or within assigned multi-disciplinary project teams. And, ideally, these exchanges were in person; the Murray Hill building was even designed with this in mind, featuring long hallways that increased the chance of serendipitous encounters.

Give them real-world, challenging problems to solve. At Bell Labs, there was no shortage of problems related to supporting AT&T's communications business: from inventing ways to save money - by replacing fragile vacuum tubes with transistors, for example, to coming up with better products and services. In addition, during WWII, Bell Labs applied its expertise in technologies, such as radar, to military applications.

Give them the best technological tools and a stable source of funding. Last but not least, innovation at Bell Labs relied on a stable source of funding coming from the phone companies that it supported.

Further reading. Jon Gertner's New York Times' op-ed provides more details from his book. The Wikipedia entry for Bell Labs lists its long history of innovation.


Saturday, October 27, 2012

Are We There Yet? Grace Hopper Conference 2012

I recently attended the 2012 Grace Hopper Celebration of Women in Computing. What an amazing experience! GHC offers a breadth of technical talks, career development, networking, and role models for technical women. This year over 3500 attendees converged onto the Baltimore Convention Center: about half were undergraduate and graduate students in computer science; the rest were tech women from industry and academia.

Are We There Yet? The keynote speaker, Nora Denzel, a tech executive formerly of Intuit, HP, and IBM, delivered an entertaining speech on the conference theme, "Are we there yet?" Her definitive answer was "no": in fact, the fraction of tech jobs held by women has declined to 25%, down from 30% a decade ago. This is a concern because research has shown that more gender diverse teams make better decisions.

Not Your Normal Tech Conference. Unlike many tech conferences, GHC included talks and panels on a breadth of technical topics. Within a three-day period, I attended interesting discussions about data intensive computingagile developmenthackathons, and enterprise social networking. The goal was to share cutting-edge academic research and the latest best practices from industry.

Career Development. Another unique aspect were the panels on career development. Catalyst research firm led a session on "Sponsors or mentors: which will get you there?" The takeaway was that mentoring is essential, but sponsorship is needed for advancement. There were ample opportunities for informal networking; over lunch I learned about Codechix, a community of women developers in the Bay Area who get together for hacking sessions.

Making a Difference. Can one woman or organization make a difference on diversity? The resounding answer at GHC was "yes!" Take for example Harvey Mudd College, a small 750-student engineering college, which this year ranked #1 among schools in GHC attendance, with 58 attending. After intensive efforts in recruiting and retaining female CS majors, Harvey Mudd's fraction of female CS grads jumped from 10% in 2005, to 40% this year.

Don't Just Take my Word. Check out this Storify overview of GHC: sfy.co/mAvA (click on "Next Page" to load each day's summary.)

Friday, October 19, 2012

Giraph for Large Multigraphs

I recently looked into using Giraph for algorithms on large multigraphs, graphs that allow multiple edges between two nodes. It turns out that Giraph is indeed a great framework for distributed graph processing on Hadoop, that often gives huge speed-ups compared to MapReduce and is open source. However, as it is a relatively new Apache incubator project (since February), there are certain features that are still missing, and some existing ones that could be better documented.

Giraph Overview

Giraph is a distributed graph processing framework that runs on Hadoop and is designed to run algorithms on really large graphs. But not too large -- Giraph assumes that your entire graph can fit within the total memory of your cluster, by splitting the graph into partitions across the cluster. In practice, this is a reasonable assumption: for example, Giraph is used at Facebook for its social graph.

Giraph's programming model closely follows Pregel (Google), where you "think like a vertex" when implementing an algorithm. It uses the Bulk Synchronous Parallel (BSP) model for distributed synchronization; an algorithm consists of a series of supersteps separated by synchronization barriers.

Giraph is an alternative to MapReduce, currently implemented as one giant Hadoop Map phase. By keeping the entire graph in memory, Giraph avoids the expensive disk reads and writes associated with each MapReduce iteration that occurs if the same algorithm were implemented in normal MapReduce. Instead of a MapReduce shuffle, at each iteration vertices message each other over the network via Netty. Therefore, algorithms with many iterations, such as PageRank, will achieve the most performance gain.

For more details, please see the Giraph talk (video) from Hadoop Summit 2012.

What's Missing from Giraph

Multigraph Support

Giraph does not fully support multigraphs; GIRAPH-141 is the relevant Jira issue. However, there is a recommended workaround that is discussed on a Giraph wiki page. It involves tracking multiple edges between two vertices by using an ArrayListWritable class as the Edge and Message types.

List of Edges, RDF as Input

One input format that Giraph does not support is a list of edges. Unlike the adjacency list format which has one vertex and all its out edges per line, in this case there is one edge per line. That is, the information for one vertex is split across multiple lines. RDF is an example of this format. A workaround is discussed in GIRAPH-170, which recommends preprocessing the input with Pig or MapReduce, to convert the input into a format that is supported.

Data Locality

To get the best performance on a really large graph, it is desirable to put the worker processes on the same node as the data that they will compute on, a central idea in Hadoop MapReduce. This data locality is not completely implemented in Giraph. GIRAPH-275 discusses this:
This is "poor man's locality" since we have no control over where Hadoop puts our workers. In Hadoop, they can go where the data is given the location info we have to work with. In Giraph, the workers go wherever Hadoop puts them and we are just hoping some data blocks are also local. 
The ultimate locality fix will involve major overhauls to how Giraph gets InputFormat/RecordReader and does task submission but might be too much until we're on YARN (or not worth it until we're on YARN.)
The patch from GIRAPH-275 (to be released in 0.2.0) does alleviate this problem somewhat.