Thursday, June 12, 2014

Storm vs. Spark Streaming: Side-by-side comparison

Overview

Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences as you will see in the following side-by-side comparison.

Processing Model, Latency

Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Whereas Storm processes incoming events one at a time, Spark Streaming batches up events that arrive within a short time window before processing them. Thus, Storm can achieve sub-second latency of processing an event, while Spark Streaming has a latency of several seconds. 

Fault Tolerance, Data Guarantees

However, the tradeoff is in the fault tolerance data guarantees. Spark Streaming provides better support for stateful computation that is fault tolerant. In Storm, each individual record has to be tracked as it moves through the system, so Storm only guarantees that each record will be processed at least once, but allows duplicates to appear during recovery from a fault. That means mutable state may be incorrectly updated twice. 

Spark Streaming, on the other hand, need only track processing at the batch level, so it can efficiently guarantee that each mini-batch will be processed exactly once, even if a fault such as a node failure occurs. [Actually, Storm's Trident library also provides exactly once processing. But, it relies on transactions to update state, which is slower and often has to be implemented by the user.]

Storm vs. Spark Streaming comparison.

Summary

In short, Storm is a good choice if you need sub-second latency and no data loss. Spark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Spark Streaming programming logic may also be easier because it is similar to batch programming, in that you are working with batches (albeit very small ones).

Implementation, Programming API

Implementation

Storm is primarily implemented in Clojure, while Spark Streaming is implemented in Scala. This is something to keep in mind if you want to look into the code to see how each system works or to make your own customizations. Storm was developed at BackType and Twitter; Spark Streaming was developed at UC Berkeley.

Programming API

Storm comes with a Java API, as well as support for other languages. Spark Streaming can be programmed in Scala as well as Java.

Batch Framework Integration

One nice feature of Spark Streaming is that it runs on Spark. Thus, you can use the same (or very similar) code that you write for batch processing and/or interactive queries in Spark, on Spark Streaming. This reduces the need to write separate code to process streaming data and historical data.

Storm vs. Spark Streaming: implementation and programming API.

Summary

Two advantages of Spark Streaming are that (1) it is not implemented in Clojure :) and (2) it is well integrated with the Spark batch computation framework.

Production, Support

Production Use

Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies. Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at Sharethrough since 2013.

Hadoop Distribution, Support

Storm is the streaming solution in the Hortonworks Hadoop data platform, whereas Spark Streaming is in both MapR's distribution and Cloudera's Enterprise data platform. In addition, Databricks is a company that provides support for the Spark stack, including Spark Streaming.

Cluster Manager Integration

Although both systems can run on their own clusters, Storm also runs on Mesos, while Spark Streaming runs on both YARN and Mesos.

Storm vs. Spark Streaming: production and support.

Summary

Storm has run in production much longer than Spark Streaming. However, Spark Streaming has the advantages that (1) it has a company dedicated to supporting it (Databricks), and (2) it is compatible with YARN.


Further Reading

For an overview of Storm, see these slides.

For a good overview of Spark Streaming, see the slides to a Strata Conference talk. A more detailed description can be found in this research paper.

Update: A couple of readers have mentioned this other comparison of Storm and Spark Streaming from Hortonworks, written in defense of Storm's features and performance.

April, 2015: Closing off comments now, since I don't have time to answer questions or keep this doc up-to-date.

31 comments:

  1. very helpful. clearly explained.

    Thanks..!!!

    ReplyDelete
  2. Great overview, thanks. I would note that there's an implementation for Storm on top of YARN though: https://github.com/yahoo/storm-yarn

    ReplyDelete
    Replies
    1. Yes, thanks for pointing that out. It says it's a "work in progress."

      Delete
  3. Thanks for this write-up! One factual correction, though: we (MapR) ship the full Apache Spark stack as part of our distribution (http://www.mapr.com/products/apache-spark) and we have many customers using Storm, though it's not yet officially part of our distribution, but then neither it is in case of Hortonworks (it has lab status, see http://hortonworks.com/labs/storm/). In this sense, I'd very much appreciate if you could update the table above accordingly.

    Cheers,
    Michael

    ReplyDelete
    Replies
    1. That's great; I have added MapR to the table under "Spark Streaming". I left Hortonworks under Storm, though, because it is in HDP 2.1 (http://hortonworks.com/hdp/whats-new/), which I consider an official release. The link that you cited does say that additional Hadoop integration, such as Storm-on-YARN, is still pending, hence the "lab" status.

      Delete
  4. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

    Above is a link to companies that are using Spark in "production". I've been using Storm for over a year and it's very mature. However I wish it had the backing of a established company like Apache Spark.

    ReplyDelete
    Replies
    1. Also, enjoyed reading your post!

      Delete
    2. Awesome, thanks for the link, Jeryl! Note to reader: these are companies that use *Spark*, but not necessarily Spark Streaming (though some of them do).

      Delete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. Thanks for the info, great for a newbie to the technologies trying to find their way!

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. This really helped understand the differences and makes it clearer for me to explain, how I will be implementing my Master's project :) Thanks Xinh!

    ReplyDelete
  10. An interesting alternative view: http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

    ReplyDelete
    Replies
    1. Nice, glad to see the perspective from someone familiar with Storm / Trident, and Hortonworks.

      Delete
  11. Do you allow me to share this on my blog ?

    ReplyDelete
  12. Thanks for the article!
    Could you please explain this point in a bit more detail? "But, it relies on transactions to update state, which is slower and often has to be implemented by the user."
    If I want to write my output to a persistent store e.g. redis, then why would it be slower in Storm than in Spark Streaming?

    ReplyDelete
    Replies
    1. Hi Josh, please check out the slide about Storm/Trident here: http://spark-summit.org/wp-content/uploads/2013/10/Spark-Summit-2013-Spark-Streaming.pdf
      If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. I.e., in word-count, for each word, you would store both the count as well as a transaction ID; each key-value pair would look like: (Key:word, Value: count, txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes extra latency. If you are using redis in memory, that might be okay, but if it has to go to disk then that would add noticeable latency to the update. Whereas in Spark, you don't have to store a per-state transaction ID.
      For the details of Trident transactional processing, see http://storm.apache.org/documentation/Trident-state

      Delete
    2. Hi Xinh, thanks for the explanation. I see, isn't that similar to Spark checkpointing - where it saves states to HDFS every ~10 seconds? or is your point that with Storm it would (by default) persist the state much more frequently than Spark?

      Delete
    3. Hi Josh, yes, the fault tolerance in Spark involves periodic (~10 second) checkpointing of RDDs. Yes, my point is that with Storm Trident the persistence occurs when each batch is processed, and by default that occurs a lot more than once every 10 seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case of failure.

      Delete
  13. Thanks for writing this article. We have just started using Spark and Spark Streaming and your article has provided us required information in deciding between Storm and Spark Streaming.

    ReplyDelete
  14. https://www.youtube.com/watch?v=ncb1t4waVZw#t=853
    Issues with Spark Streaming running in production, maybe relevant for some exploring Spark for streaming.

    ""We're currently using Spark Streaming but I'm this close to ripping it out of the system in favor of anything else"
    "I think that Spark is a fantastic system but Spark streaming is a whole different animal and it's just not there yet in terms of production quality"
    "I also can't ship Scala code when the bytecode changes with every [omitted] release"

    ReplyDelete
  15. Why do you think that "Spark Streaming it is not implemented in Clojure" is an advantage? Both Scala and Clojure are JVM languages with functional leanings.

    ReplyDelete
    Replies
    1. Hi Kiran, you are right: both are functional languages, and that is indeed an advantage in implementing robust distributed systems. I should have pointed that out. What I was referring to was to a Java developer who wants to look into the code, Scala is a lot more friendly in terms of syntax, as compared to Clojure, which does not try to look like Java, and whose Lisp syntax might look unapproachable.

      Delete