Tuesday, January 14, 2014

Scaling Realtime Analytics on Big Data

Scaling Realtime Analytics

How do you scale realtime analytics for big data? Consider the case where big data consists of both incoming streaming data, as well as historical data, and you are providing analytics in realtime over the entire data set. In other words, the data is big, and growing quickly, while the desired latency is small. To implement this at scale you can distribute the processing over a cluster of machines. Of course, distributed stream processing comes with some challenges: implementing incremental algorithms, ensuring data consistency, and making the system robust by recovering quickly from node failures and dealing with dropped or duplicate data.

Lambda Architecture

One solution to this problem is the Lambda architecture, as described by Nathan Marz (formerly of BackType, Twitter). This architecture splits up the analytics calculation into a batch layer and a speed (streaming) layer. In the batch layer, as much of the calculation as possible is precomputed with batch processing over historical data. As the dataset grows, the calculation is constantly recomputed. The speed layer applies stream processing on only the most recent data. In this way, most of the processing takes place in batch jobs, in a platform such as Hadoop, while the complexity of stream processing is minimized to affect only the latest data. Errors or approximations that arise during stream processing are corrected when the data is later reprocessed in the batch layer. A serving layer combines the intermediate results from the batch and speed layers to produce the final analytics.

From Nathan Marz's presentation "Runaway Complexity in Big Data"

The robustness of this approach relies on two things: (1) immutable data and (2) continuous recomputation. That is, a master dataset of all the raw data is kept around indefinitely, and, as this dataset grows, calculations are constantly recomputed over the entire data set. Any failures, bugs, approximations, or missing features in the data processing are easily corrected because you are always recomputing everything from scratch. This architecture aligns well with the fundamental assumptions of big data: horizontally scalable data storage and computation. You can read more about the motivation behind this architecture in Nathan Marz's blog post. It is also described in detail in the book Big Data.

Lambda Architecture in Practice

In practice, you could run the batch layer on Hadoop, the speed (stream processing) layer on Storm, and store the results in a scalable database such as Cassandra. These intermediate results are read and merged by the serving layer.

Avoiding Code Duplication

Many applications will use the same processing logic in both the batch and speed layers. For example, to count the number of page views of a URL, in the batch layer you might implement a Hadoop Map function that adds one for each new page view, and in the speed layer write a Storm bolt that does the same thing. In a real world application, this could lead to implementing the same business logic twice: once for the batch layer and again for the speed layer. To avoid duplicate code, there are a couple of options: (1) adopting the Summingbird framework or (2) running Spark and Spark Streaming.

Summingbird

Summingbird (a Twitter open source project) is a library that lets you write MapReduce programs in a Scala DSL, which can then be executed on different batch and stream processing platforms, such as Scalding (Hadoop) or Storm. It also provides a framework for the Lambda architecture. All you have to do is write your processing logic as a MapReduce program. Then you configure Summingbird, in its hybrid batch/realtime mode, to execute this program on Scalding, for the batch layer, and Storm, for the speed layer, and then to merge the results into a final analytic.

Spark and Spark Streaming


Alternatively, instead of Hadoop/Storm, you could run the batch layer on Spark and the streaming layer on Spark Streaming (both open source projects from UC Berkeley). Because both platforms share the same Scala API, you would be able to reuse the same code for both the batch and speed layers. Unlike with Summingbird, however, you would need to implement the code to merge the intermediate results yourself.

Additional Info