Saturday, July 6, 2013

What's New at Hadoop Summit 2013?

YARN

YARN, the new cluster resource manager in Hadoop 2.0, was a major theme at last week's Hadoop Summit. Although the project itself is not new (in fact, it has been in development for several years), what's new is its growing adoption by the Hadoop community. YARN (Yet Another Resource Negotiator) plays a central role in Hadoop 2.0: it is the cluster resource manager that allows you to run multiple computing frameworks, such as Storm or Spark, in addition to MapReduce, all on the same Hadoop cluster.

There was evidence of community adoption of YARN throughout Hadoop Summit: (1) a keynote by Yahoo! describing their production analytics stack built on YARN (video), (2) talks about the Stinger Initiative to speed up Hive by 100x, which relies on YARN (through the new Tez framework), and (3) the announcement of YARN's release in HDP 2.0, Hortonworks' latest distribution of Hadoop.

YARN at Yahoo!

According to the Yahoo! keynote, YARN has been undergoing some serious load testing in production Yahoo! systems for personalization and ad targeting. Their YARN clusters run Storm, Spark, and HBase, in addition to MapReduce. This includes a 320-node Storm/YARN cluster that does stream processing, and an overall total of 400k YARN jobs per day. This Strata blogpost contains more details about the talk.

Tez

Tez is a new compute framework that runs on YARN. Tez improves upon MapReduce by supporting the execution of a complex DAG of tasks, beyond the simple map-reduce pattern of MapReduce. Tez is thus more suitable for expressing SQL queries, and will be leveraged to speed up Hive jobs.

Hadoop on Rasberry Pi

Hadoop Summit has traditionally been the place to brag about who has the biggest clusters; however, this LinkedIn demo goes to the opposite extreme :)