Thursday, December 4, 2014

Lightbeam: Shines a Light on Creepy Ads

Have you ever had a creepy ad experience? Have you gone online to search for lodging in Lake Tahoe, and then two minutes later received an email from Groupon for a "Tahoe weekend getaway"? Sure, many companies advertise online, but Groupon ads seem extremely targeted. So, I was curious: "how much is Groupon tracking me?"

Lightbeam Shows Third-Party Tracking

One way that websites target ads is by tracking a user's online behavior. When a user visits a website, the web page may execute tracking scripts and store cookies, which allow third-party ad networks to track that user. These ad networks can then collect information whenever the user visits another website that partners with them. (More info here.)

To investigate the extent of Groupon's tracking, I installed the Lightbeam add-on to my Firefox browser. This add-on, released by Mozilla about a year ago, shows all the third-party cookies and connections made by websites, in an interactive visualization. After browsing to a few websites (Yahoo, Google, Groupon), I could see that Groupon contacted by far the largest number of third-party domains for ad tracking, 17 in all, 8 of which also created a cookie.

Groupon (the green "G") made the most third-party connections, as shown in Lightbeam.

Third-party Domains

What are these third-party domains that Groupon connected with?

  • Google Tag Manager: "Tags are tiny bits of website code that let you measure traffic and visitor behavior."
  • Criteo: "Personalized retargeting company."
  • Facebook
  • "The most mature ad optimization and bid management system in the business."
  • Rubicon Project: "Leading the automation of advertising."
  • Doubleclick: Internet ad serving, subsidiary of Google.
  • PubMatic: "Programmatic advertising."
  • Yahoo
  • Channel Intelligence (Google): "Optimize your data and sell more products."
  • Nanigans: "The industry-leading advertising automation platform."
  • Atwola: "Atwola is a cookie that is used to track user browsing history." 

Curious about ad tracking in websites you visit? Check out the Lightbeam Firefox add-on

Sunday, September 28, 2014

Higgs Data Challenge: Particle Physics meets Machine Learning

The HiggsML Challenge on Kaggle

How do you frame a particle physics problem as a data challenge? Will it be accessible to Kagglers? Will data scientists be able to outperform physicists with machine learning? (as was the case in the NASA dark matter challenge.) The Kaggle Higgs competition recently came to a close with great results.

The HiggsML challenge: when High Energy Physics meets Machine Learning.

Competition summary.

A Particle Physics Data Challenge

Goal: Classify collision events

The goal of the competition was to classify 550,000 proton-proton collision events as either originating from a Higgs boson decay (a "signal" event) or not (a "background" event). This type of analysis can be used in experiments at the Large Hadron Collider to discover decay channels of the Higgs boson.

Data: Tracks of particles produced in each collision

Each proton-proton collision, or event, produces a number of outgoing particles, whose tracks are recorded by a particle detector. The training data consisted of 250,000 labelled events, obtained from simulations of the collisions and detector. For each event, 30 features based on particle track data, such as particle mass, speed, and direction, were provided. The actual experimental data is highly unbalanced (much more background than signal), but the training data provided was fairly balanced.

Particle tracks: a Higgs boson decays into two τ particles, each of which subsequently decays into either an electron (blue line) or a muon (red line). ATLAS Experiment © 2014 CERN.

Evaluation Metric: Significance of discovery

Instead of a standard performance metric, submissions were evaluated by Approximate Median Significance (AMS), a measure of how well a classifier is able to discover new phenomena.

When a classifier labels events (in experimental data, not simulations) as "signal", the p-value is the probability that those events were actually just background events. The corresponding "significance" (Z) is the normal quantile of that p-value, measured in units of sigma. We want a small p-value and large "significance", before claiming a discovery. Finally, AMS is an estimate of the "significance" for a given classifier. For more details, see the Kaggle evaluation page.

Making Particle Physics Accessible

The Organizers

The contest organizers did a fantastic job of designing and organizing this competition. A group of physicists and machine learning researchers, they took a difficult subject, particle physics, and spent 18 months creating a data challenge that ultimately attracted 1,785 teams (a Kaggle record).

Their success can be attributed to the following:

  • Designing a dataset that could be analyzed without any domain knowledge. 
  • Providing multiple software starter kits that made it easy to get an analysis up and running. 
  • Providing a 16-page document explaining the challenge.
  • Answering questions promptly on the Kaggle discussion forum.

Not needed to win the contest. "Higgs-gluon-fusion" by TimothyRias. Licensed under Creative Commons Attribution-Share Alike 3.0.

The Results

Did machine learning give data scientists an edge? How did the physicists fare?

Cross validation was key

It was not machine learning, but the statistical technique of cross validation, that was key to winning this competition. The AMS metric had a high variance, which made it difficult to know how well a model was performing. You could not rely on the score on the public leaderboard because it was calculated on only a small subset of the data. Therefore, a rigorous cross validation (CV) method was needed for model comparison and parameter tuning. The 1st-place team, Gabor Melis, did 2-fold stratified CV repeated 35 times on random shuffles of the training data. Both the 1st-place and 4th-place teams mentioned cross validation as key to a high score on the forum.

Insufficient CV could lead to overfitting and large shake-ups in the final ranking, as can be seen in a meta analysis of the competition scores.

Overfitting led to shake-ups in the final ranking. "Overfitting svg" by Gringer. Licensed under Creative Commons Attribution 3.0

Feature engineering

Feature engineering was less important in this contest, having a small impact on the score. The reason was the organizers had already designed good features, such as those used in the actual Higgs discovery, into the provided dataset. Participants did come up with good features, such as a feature named “Cake”, which discriminated between the signal and one major source of background events, the Z boson.

Software and Algorithms

The software libraries and algorithms discussed on the discussion forum were mostly standard ones: gradient tree boosting, bagging, Python scikit-learn, and R. The 1st-place entry used a bag of 70 dropout neural networks (overview, model and code). Early in the competition, Tianqi Chen shared a fast gradient boosting library called XGBoost, which was popular among participants.

The 1st-place entry used a bag of 70 dropout neural networks. "Colored neural network" by Licensed under Creative Commons Attribution-Share Alike 3.0

Particle Physics meets Machine Learning

The organizers' overall goal was to increase the cross fertilization of ideas between the particle physics and machine learning communities. In that respect, this challenge was a good first step.

Physicists who participated began to appreciate machine learning techniques. This quote from the forum summarizes that sentiment: "The process of attempting as a physicist to compete against ML experts has given us a new respect for a field that (through ignorance) none of us held in as high esteem as we do now."

On the machine learning side, one researcher, Lester Mackey, developed a method to optimize AMS directly. In scikit-learn, support for sample weights was added to the GradientBoostingClassifier, in part due to interest from this competition.

And, much more will be discussed at the upcoming NIPS workshop on particle physics and machine learning. There will also be a workshop on data challenges in machine learning.

There will be a workshop on particle physics and machine learning at the NIPS machine learning conference.

Personal Note

I studied physics at Caltech and Stanford / SLAC (Stanford Linear Accelerator Center), before switching to computer science. Therefore, I could not pass up the opportunity to participate in this challenge myself. This was my first Kaggle competition, in which I ranked in the top 10%.

Further Reading

  • The challenge website, with supplementary information not found on Kaggle: The HiggsML Challenge
  • Tim Salimans, placed 2nd, with a blend of boosted decision tree ensembles: details
  • Courtiol Pierre, placed 3rd, with an ensemble of neural networks: details
  • Lubos Motl, a theoretical physicist who ranked 8th, wrote several blog posts about the competition: initial post
  • "Log0" ranked 23rd by using scikit-learn AdaBoostClassifier and ExtraTreesClassifier: forum post
  • "phunter" ranked 26th with a single XGBoost model: What can a single model do?
  • Darin Baumgartel shared code based on scikit-learn GradientBoostingClassifier: post
  • Trevor Stephens, a data scientist, tried out hundreds of automatically generated features: Armchair Particle Physicist
  • Bios of participants who reached the top of the public leaderboard: Portraits
  • Balazs Kegl, one of the organizers, talked about the challenge in this video: CERN seminar

Thursday, June 12, 2014

Storm vs. Spark Streaming: Side-by-side comparison


Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences as you will see in the following side-by-side comparison.

Processing Model, Latency

Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Whereas Storm processes incoming events one at a time, Spark Streaming batches up events that arrive within a short time window before processing them. Thus, Storm can achieve sub-second latency of processing an event, while Spark Streaming has a latency of several seconds. 

Fault Tolerance, Data Guarantees

However, the tradeoff is in the fault tolerance data guarantees. Spark Streaming provides better support for stateful computation that is fault tolerant. In Storm, each individual record has to be tracked as it moves through the system, so Storm only guarantees that each record will be processed at least once, but allows duplicates to appear during recovery from a fault. That means mutable state may be incorrectly updated twice. 

Spark Streaming, on the other hand, need only track processing at the batch level, so it can efficiently guarantee that each mini-batch will be processed exactly once, even if a fault such as a node failure occurs. [Actually, Storm's Trident library also provides exactly once processing. But, it relies on transactions to update state, which is slower and often has to be implemented by the user.]

Storm vs. Spark Streaming comparison.


In short, Storm is a good choice if you need sub-second latency and no data loss. Spark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Spark Streaming programming logic may also be easier because it is similar to batch programming, in that you are working with batches (albeit very small ones).

Implementation, Programming API


Storm is primarily implemented in Clojure, while Spark Streaming is implemented in Scala. This is something to keep in mind if you want to look into the code to see how each system works or to make your own customizations. Storm was developed at BackType and Twitter; Spark Streaming was developed at UC Berkeley.

Programming API

Storm comes with a Java API, as well as support for other languages. Spark Streaming can be programmed in Scala as well as Java.

Batch Framework Integration

One nice feature of Spark Streaming is that it runs on Spark. Thus, you can use the same (or very similar) code that you write for batch processing and/or interactive queries in Spark, on Spark Streaming. This reduces the need to write separate code to process streaming data and historical data.

Storm vs. Spark Streaming: implementation and programming API.


Two advantages of Spark Streaming are that (1) it is not implemented in Clojure :) and (2) it is well integrated with the Spark batch computation framework.

Production, Support

Production Use

Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies. Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at Sharethrough since 2013.

Hadoop Distribution, Support

Storm is the streaming solution in the Hortonworks Hadoop data platform, whereas Spark Streaming is in both MapR's distribution and Cloudera's Enterprise data platform. In addition, Databricks is a company that provides support for the Spark stack, including Spark Streaming.

Cluster Manager Integration

Although both systems can run on their own clusters, Storm also runs on Mesos, while Spark Streaming runs on both YARN and Mesos.

Storm vs. Spark Streaming: production and support.


Storm has run in production much longer than Spark Streaming. However, Spark Streaming has the advantages that (1) it has a company dedicated to supporting it (Databricks), and (2) it is compatible with YARN.

Further Reading

For an overview of Storm, see these slides.

For a good overview of Spark Streaming, see the slides to a Strata Conference talk. A more detailed description can be found in this research paper.

Update: A couple of readers have mentioned this other comparison of Storm and Spark Streaming from Hortonworks, written in defense of Storm's features and performance.

April, 2015: Closing off comments now, since I don't have time to answer questions or keep this doc up-to-date.

Friday, April 25, 2014

Bay Area Bike Share Data

Bay Area Bike Share recently made some of their bike trip data available for their open data challenge. Curious about how this new transportation option is used, I took a look at the data myself. The code behind my analysis is available on github.

Trip Length

This bike share system is designed for short trips, between bike stations. Of the 144,000 trips in the data set, the median trip length was only nine minutes. And, about 96% of trips were less than one hour long. This makes sense since there is a late fee charged if a bike is not returned to a station within half an hour. Of course, one could go on longer rides without incurring a late fee by hopping between stations, as long as each hop is shorter than a half hour; however, the data set does not provide enough information to infer whether a trip was part of a multi-leg ride.

The median trip length was nine minutes.

Late Fees

About 4% of the trips (5696) were longer than one hour and, thus, incurred a late fee. In fact, the average length of these long trips was about 4.5 hours. Since a late charge of $7 applies to every half hour after one hour, this amounts to $49 in late fees, for a total of about $280k in late fees collected overall. Now that's a great revenue stream! :)

Why are these bikes returned late? Did the rider not know about the late fee? Did they get lost? I do not know, but a vast majority of these trips, 94% (5379), were taken by short-term customers (customers with 1- or 3-day memberships), as opposed to subscribers (who hold annual memberships), even though short-term customers are represented in only 21% of all trips.

Popular Stations

The top three stations at which trips started were all in downtown San Francisco:
  1. San Francisco Caltrain (Townsend at 4th)     :  9838
  2. Harry Bridges Plaza (Ferry Building)         :  7343
  3. Embarcadero at Sansome                       :  6545
However, if broken down by membership type, the most popular stations were:
  • Among short-term customers (1- or 3-day memberships): Embarcadero at Sansome
  • Among subscribers (annual membership): San Francisco Caltrain (Townsend at 4th)

Day of Week

The weekly patterns also vary by membership type. Short-term customers tend to take more trips over the weekend, while subscribers tend to ride on week days:

Customers (with 1- or 3-day memberships) preferred weekends.

Subscribers (with annual memberships) preferred weekdays.

Additional Data

The above analysis was only on trip data. The data set also contained weather data, bike station information, and station rebalancing data. If you are interested in taking a look, check out the data from Bay Area Bike Share and the R code used in my analysis. 

Tuesday, January 14, 2014

Scaling Realtime Analytics on Big Data

Scaling Realtime Analytics

How do you scale realtime analytics for big data? Consider the case where big data consists of both incoming streaming data, as well as historical data, and you are providing analytics in realtime over the entire data set. In other words, the data is big, and growing quickly, while the desired latency is small. To implement this at scale you can distribute the processing over a cluster of machines. Of course, distributed stream processing comes with some challenges: implementing incremental algorithms, ensuring data consistency, and making the system robust by recovering quickly from node failures and dealing with dropped or duplicate data.

Lambda Architecture

One solution to this problem is the Lambda architecture, as described by Nathan Marz (formerly of BackType, Twitter). This architecture splits up the analytics calculation into a batch layer and a speed (streaming) layer. In the batch layer, as much of the calculation as possible is precomputed with batch processing over historical data. As the dataset grows, the calculation is constantly recomputed. The speed layer applies stream processing on only the most recent data. In this way, most of the processing takes place in batch jobs, in a platform such as Hadoop, while the complexity of stream processing is minimized to affect only the latest data. Errors or approximations that arise during stream processing are corrected when the data is later reprocessed in the batch layer. A serving layer combines the intermediate results from the batch and speed layers to produce the final analytics.

From Nathan Marz's presentation "Runaway Complexity in Big Data"

The robustness of this approach relies on two things: (1) immutable data and (2) continuous recomputation. That is, a master dataset of all the raw data is kept around indefinitely, and, as this dataset grows, calculations are constantly recomputed over the entire data set. Any failures, bugs, approximations, or missing features in the data processing are easily corrected because you are always recomputing everything from scratch. This architecture aligns well with the fundamental assumptions of big data: horizontally scalable data storage and computation. You can read more about the motivation behind this architecture in Nathan Marz's blog post. It is also described in detail in the book Big Data.

Lambda Architecture in Practice

In practice, you could run the batch layer on Hadoop, the speed (stream processing) layer on Storm, and store the results in a scalable database such as Cassandra. These intermediate results are read and merged by the serving layer.

Avoiding Code Duplication

Many applications will use the same processing logic in both the batch and speed layers. For example, to count the number of page views of a URL, in the batch layer you might implement a Hadoop Map function that adds one for each new page view, and in the speed layer write a Storm bolt that does the same thing. In a real world application, this could lead to implementing the same business logic twice: once for the batch layer and again for the speed layer. To avoid duplicate code, there are a couple of options: (1) adopting the Summingbird framework or (2) running Spark and Spark Streaming.


Summingbird (a Twitter open source project) is a library that lets you write MapReduce programs in a Scala DSL, which can then be executed on different batch and stream processing platforms, such as Scalding (Hadoop) or Storm. It also provides a framework for the Lambda architecture. All you have to do is write your processing logic as a MapReduce program. Then you configure Summingbird, in its hybrid batch/realtime mode, to execute this program on Scalding, for the batch layer, and Storm, for the speed layer, and then to merge the results into a final analytic.

Spark and Spark Streaming

Alternatively, instead of Hadoop/Storm, you could run the batch layer on Spark and the streaming layer on Spark Streaming (both open source projects from UC Berkeley). Because both platforms share the same Scala API, you would be able to reuse the same code for both the batch and speed layers. Unlike with Summingbird, however, you would need to implement the code to merge the intermediate results yourself.

Additional Info