Saturday, October 27, 2012

Are We There Yet? Grace Hopper Conference 2012

I recently attended the 2012 Grace Hopper Celebration of Women in Computing. What an amazing experience! GHC offers a breadth of technical talks, career development, networking, and role models for technical women. This year over 3500 attendees converged onto the Baltimore Convention Center: about half were undergraduate and graduate students in computer science; the rest were tech women from industry and academia.

Are We There Yet? The keynote speaker, Nora Denzel, a tech executive formerly of Intuit, HP, and IBM, delivered an entertaining speech on the conference theme, "Are we there yet?" Her definitive answer was "no": in fact, the fraction of tech jobs held by women has declined to 25%, down from 30% a decade ago. This is a concern because research has shown that more gender diverse teams make better decisions.

Not Your Normal Tech Conference. Unlike many tech conferences, GHC included talks and panels on a breadth of technical topics. Within a three-day period, I attended interesting discussions about data intensive computingagile developmenthackathons, and enterprise social networking. The goal was to share cutting-edge academic research and the latest best practices from industry.

Career Development. Another unique aspect were the panels on career development. Catalyst research firm led a session on "Sponsors or mentors: which will get you there?" The takeaway was that mentoring is essential, but sponsorship is needed for advancement. There were ample opportunities for informal networking; over lunch I learned about Codechix, a community of women developers in the Bay Area who get together for hacking sessions.

Making a Difference. Can one woman or organization make a difference on diversity? The resounding answer at GHC was "yes!" Take for example Harvey Mudd College, a small 750-student engineering college, which this year ranked #1 among schools in GHC attendance, with 58 attending. After intensive efforts in recruiting and retaining female CS majors, Harvey Mudd's fraction of female CS grads jumped from 10% in 2005, to 40% this year.

Don't Just Take my Word. Check out this Storify overview of GHC: (click on "Next Page" to load each day's summary.)

Friday, October 19, 2012

Giraph for Large Multigraphs

I recently looked into using Giraph for algorithms on large multigraphs, graphs that allow multiple edges between two nodes. It turns out that Giraph is indeed a great framework for distributed graph processing on Hadoop, that often gives huge speed-ups compared to MapReduce and is open source. However, as it is a relatively new Apache incubator project (since February), there are certain features that are still missing, and some existing ones that could be better documented.

Giraph Overview

Giraph is a distributed graph processing framework that runs on Hadoop and is designed to run algorithms on really large graphs. But not too large -- Giraph assumes that your entire graph can fit within the total memory of your cluster, by splitting the graph into partitions across the cluster. In practice, this is a reasonable assumption: for example, Giraph is used at Facebook for its social graph.

Giraph's programming model closely follows Pregel (Google), where you "think like a vertex" when implementing an algorithm. It uses the Bulk Synchronous Parallel (BSP) model for distributed synchronization; an algorithm consists of a series of supersteps separated by synchronization barriers.

Giraph is an alternative to MapReduce, currently implemented as one giant Hadoop Map phase. By keeping the entire graph in memory, Giraph avoids the expensive disk reads and writes associated with each MapReduce iteration that occurs if the same algorithm were implemented in normal MapReduce. Instead of a MapReduce shuffle, at each iteration vertices message each other over the network via Netty. Therefore, algorithms with many iterations, such as PageRank, will achieve the most performance gain.

For more details, please see the Giraph talk (video) from Hadoop Summit 2012.

What's Missing from Giraph

Multigraph Support

Giraph does not fully support multigraphs; GIRAPH-141 is the relevant Jira issue. However, there is a recommended workaround that is discussed on a Giraph wiki page. It involves tracking multiple edges between two vertices by using an ArrayListWritable class as the Edge and Message types.

List of Edges, RDF as Input

One input format that Giraph does not support is a list of edges. Unlike the adjacency list format which has one vertex and all its out edges per line, in this case there is one edge per line. That is, the information for one vertex is split across multiple lines. RDF is an example of this format. A workaround is discussed in GIRAPH-170, which recommends preprocessing the input with Pig or MapReduce, to convert the input into a format that is supported.

Data Locality

To get the best performance on a really large graph, it is desirable to put the worker processes on the same node as the data that they will compute on, a central idea in Hadoop MapReduce. This data locality is not completely implemented in Giraph. GIRAPH-275 discusses this:
This is "poor man's locality" since we have no control over where Hadoop puts our workers. In Hadoop, they can go where the data is given the location info we have to work with. In Giraph, the workers go wherever Hadoop puts them and we are just hoping some data blocks are also local. 
The ultimate locality fix will involve major overhauls to how Giraph gets InputFormat/RecordReader and does task submission but might be too much until we're on YARN (or not worth it until we're on YARN.)
The patch from GIRAPH-275 (to be released in 0.2.0) does alleviate this problem somewhat.