Xinh's Tech Blog: Spark streaming

Showing posts with label Spark streaming. Show all posts

Thursday, April 28, 2016

Introduction to Spark for Developers and Data Scientists

What is Spark?

Spark is “a fast and general engine for large-scale data processing”. – http://spark.apache.org/

Spark is also one of the most popular open source frameworks for big data, based on number of contributors. Let us find out why this is the case.

When do you use Spark?

Suppose you would like to analyze a data set: perform ETL or data munging, then run SQL queries such as grouping and aggregations against the data, and maybe apply a machine learning algorithm. When the data size is small, everything will run quickly on a single machine, and you can use analysis tools like Pandas (Python), R, or Excel, or write your own scripts. But, for larger data sets, data processing will be too slow on a single machine, and then you will want to move to a cluster of machines. This is when you would use Spark.

You could probably benefit from Spark if:

Your data is currently stored in Hadoop / HDFS.
Your data set contains more than 100 million rows.
Ad-hoc queries take longer than 5 minutes to complete.

What Spark is Not, typical architecture

Spark can be a central component to a big data system, but it is not the only component. It is not a distributed file system: you would typically store your data on HDFS or S3. Nor is Spark a NoSQL database: Cassandra or HBase would be a better place for horizontally scalable table storage. And, it is not a message queue: you would use Kafka or Flume to collect streaming event data. Spark is, however, a compute engine which can take input or send output to all of these other systems.

How do you use Spark?

Implemented in Scala, Spark can be programmed in Scala, Java, Python, SQL, and R. However, not all of the latest functionality is immediately available in all languages.

What kind of operations does Spark support?

Spark SQL. Spark supports batch operations involved in ETL and data munging, via the DataFrame API. It supports parsing different input formats, such as JSON or Parquet. Once the raw data is loaded, you can easily compute new columns from existing columns. You can slice and dice the data by filtering, grouping, aggregating, and joining with other tables. Spark supports relational queries, which you can express in SQL or through the DataFrame API.

Spark Streaming. Spark also provides scalable stream processing. Given an input data stream, for example, coming from Kafka, Spark allows you to perform operations on the streaming data, such as map, reduce, join, and window.

MLlib. Spark includes a machine learning library, with scalable algorithms for classification, regression, collaborative filtering, clustering, and more. If training your dataset on a single machine takes too long, you might consider cluster computing with Spark.

GraphX. Finally, GraphX is a component in Spark for scalable batch processing on graphs.

As you may have noticed by now, Spark processing is batch oriented. It works best when you want to perform the same operation on all of your data, or a large subset of your data. Even with Spark Streaming, you operate on small batches of the data stream, rather than one event at a time.

Spark vs. Hadoop MapReduce, Hive, Impala

How does Spark compare with other big data compute engines? Unlike Hadoop MapReduce, Spark caches data in memory for huge performance gains when you have ad-hoc queries or iterative workloads, which are common in machine learning algorithms. Hive and Impala both run SQL queries at scale; the advantage of Spark over these systems is (1) the convenience of writing both queries and UDFs in the same language, such as Scala, and (2) support for machine learning algorithms, streaming data, and graph processing within the same system.

Conclusion

This overview has explained what Spark is, when to use it, what kinds of operations it supports, and how it compares with other big data systems. To learn more, please take a look at the Spark website (http://spark.apache.org/).

Thursday, June 12, 2014

Storm vs. Spark Streaming: Side-by-side comparison

Overview

Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences as you will see in the following side-by-side comparison.

Processing Model, Latency

Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Whereas Storm processes incoming events one at a time, Spark Streaming batches up events that arrive within a short time window before processing them. Thus, Storm can achieve sub-second latency of processing an event, while Spark Streaming has a latency of several seconds.

Fault Tolerance, Data Guarantees

However, the tradeoff is in the fault tolerance data guarantees. Spark Streaming provides better support for stateful computation that is fault tolerant. In Storm, each individual record has to be tracked as it moves through the system, so Storm only guarantees that each record will be processed at least once, but allows duplicates to appear during recovery from a fault. That means mutable state may be incorrectly updated twice.

Spark Streaming, on the other hand, need only track processing at the batch level, so it can efficiently guarantee that each mini-batch will be processed exactly once, even if a fault such as a node failure occurs. [Actually, Storm's Trident library also provides exactly once processing. But, it relies on transactions to update state, which is slower and often has to be implemented by the user.]

Storm vs. Spark Streaming comparison.

Summary

In short, Storm is a good choice if you need sub-second latency and no data loss. Spark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Spark Streaming programming logic may also be easier because it is similar to batch programming, in that you are working with batches (albeit very small ones).

Implementation, Programming API

Implementation

Storm is primarily implemented in Clojure, while Spark Streaming is implemented in Scala. This is something to keep in mind if you want to look into the code to see how each system works or to make your own customizations. Storm was developed at BackType and Twitter; Spark Streaming was developed at UC Berkeley.

Programming API

Storm comes with a Java API, as well as support for other languages. Spark Streaming can be programmed in Scala as well as Java.

Batch Framework Integration

One nice feature of Spark Streaming is that it runs on Spark. Thus, you can use the same (or very similar) code that you write for batch processing and/or interactive queries in Spark, on Spark Streaming. This reduces the need to write separate code to process streaming data and historical data.

Storm vs. Spark Streaming: implementation and programming API.

Summary

Two advantages of Spark Streaming are that (1) it is not implemented in Clojure :) and (2) it is well integrated with the Spark batch computation framework.

Production, Support

Production Use

Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies. Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at Sharethrough since 2013.

Hadoop Distribution, Support

Storm is the streaming solution in the Hortonworks Hadoop data platform, whereas Spark Streaming is in both MapR's distribution and Cloudera's Enterprise data platform. In addition, Databricks is a company that provides support for the Spark stack, including Spark Streaming.

Cluster Manager Integration

Although both systems can run on their own clusters, Storm also runs on Mesos, while Spark Streaming runs on both YARN and Mesos.

Storm vs. Spark Streaming: production and support.

Summary

Storm has run in production much longer than Spark Streaming. However, Spark Streaming has the advantages that (1) it has a company dedicated to supporting it (Databricks), and (2) it is compatible with YARN.

Tuesday, January 14, 2014

Scaling Realtime Analytics on Big Data

Scaling Realtime Analytics

How do you scale realtime analytics for big data? Consider the case where big data consists of both incoming streaming data, as well as historical data, and you are providing analytics in realtime over the entire data set. In other words, the data is big, and growing quickly, while the desired latency is small. To implement this at scale you can distribute the processing over a cluster of machines. Of course, distributed stream processing comes with some challenges: implementing incremental algorithms, ensuring data consistency, and making the system robust by recovering quickly from node failures and dealing with dropped or duplicate data.

Lambda Architecture

One solution to this problem is the Lambda architecture, as described by Nathan Marz (formerly of BackType, Twitter). This architecture splits up the analytics calculation into a batch layer and a speed (streaming) layer. In the batch layer, as much of the calculation as possible is precomputed with batch processing over historical data. As the dataset grows, the calculation is constantly recomputed. The speed layer applies stream processing on only the most recent data. In this way, most of the processing takes place in batch jobs, in a platform such as Hadoop, while the complexity of stream processing is minimized to affect only the latest data. Errors or approximations that arise during stream processing are corrected when the data is later reprocessed in the batch layer. A serving layer combines the intermediate results from the batch and speed layers to produce the final analytics.

From Nathan Marz's presentation "Runaway Complexity in Big Data"

The robustness of this approach relies on two things: (1) immutable data and (2) continuous recomputation. That is, a master dataset of all the raw data is kept around indefinitely, and, as this dataset grows, calculations are constantly recomputed over the entire data set. Any failures, bugs, approximations, or missing features in the data processing are easily corrected because you are always recomputing everything from scratch. This architecture aligns well with the fundamental assumptions of big data: horizontally scalable data storage and computation. You can read more about the motivation behind this architecture in Nathan Marz's blog post. It is also described in detail in the book Big Data.

Lambda Architecture in Practice

In practice, you could run the batch layer on Hadoop, the speed (stream processing) layer on Storm, and store the results in a scalable database such as Cassandra. These intermediate results are read and merged by the serving layer.

Avoiding Code Duplication

Many applications will use the same processing logic in both the batch and speed layers. For example, to count the number of page views of a URL, in the batch layer you might implement a Hadoop Map function that adds one for each new page view, and in the speed layer write a Storm bolt that does the same thing. In a real world application, this could lead to implementing the same business logic twice: once for the batch layer and again for the speed layer. To avoid duplicate code, there are a couple of options: (1) adopting the Summingbird framework or (2) running Spark and Spark Streaming.

Summingbird

Summingbird (a Twitter open source project) is a library that lets you write MapReduce programs in a Scala DSL, which can then be executed on different batch and stream processing platforms, such as Scalding (Hadoop) or Storm. It also provides a framework for the Lambda architecture. All you have to do is write your processing logic as a MapReduce program. Then you configure Summingbird, in its hybrid batch/realtime mode, to execute this program on Scalding, for the batch layer, and Storm, for the speed layer, and then to merge the results into a final analytic.

Spark and Spark Streaming

Alternatively, instead of Hadoop/Storm, you could run the batch layer on Spark and the streaming layer on Spark Streaming (both open source projects from UC Berkeley). Because both platforms share the same Scala API, you would be able to reuse the same code for both the batch and speed layers. Unlike with Summingbird, however, you would need to implement the code to merge the intermediate results yourself.

Additional Info

Video of Nathan Marz talk about the Lambda architecture: "Runaway complexity in Big Data systems"
James Kinley's blog post describing the Lambda architecture: James Kinley's post
Examples of the Lambda architecture: lambda-architecture.net

Thursday, April 28, 2016

Introduction to Spark for Developers and Data Scientists

What is Spark?

When do you use Spark?

What Spark is Not, typical architecture

How do you use Spark?

What kind of operations does Spark support?

Spark vs. Hadoop MapReduce, Hive, Impala

Conclusion

Thursday, June 12, 2014

Storm vs. Spark Streaming: Side-by-side comparison

Overview

Processing Model, Latency

Fault Tolerance, Data Guarantees

Summary

Implementation, Programming API

Implementation

Programming API

Batch Framework Integration

Summary

Production, Support

Production Use

Hadoop Distribution, Support

Cluster Manager Integration

Summary

Further Reading

Tuesday, January 14, 2014

Scaling Realtime Analytics on Big Data

Scaling Realtime Analytics

Lambda Architecture

Lambda Architecture in Practice

Avoiding Code Duplication

Summingbird

Spark and Spark Streaming

Additional Info