Xinh's Tech Blog: Higgs Data Challenge: Particle Physics meets Machine Learning

The HiggsML Challenge on Kaggle

How do you frame a particle physics problem as a data challenge? Will it be accessible to Kagglers? Will data scientists be able to outperform physicists with machine learning? (as was the case in the NASA dark matter challenge.) The Kaggle Higgs competition recently came to a close with great results.

The HiggsML challenge: when High Energy Physics meets Machine Learning.

Competition summary.

A Particle Physics Data Challenge

Goal: Classify collision events

The goal of the competition was to classify 550,000 proton-proton collision events as either originating from a Higgs boson decay (a "signal" event) or not (a "background" event). This type of analysis can be used in experiments at the Large Hadron Collider to discover decay channels of the Higgs boson.

Data: Tracks of particles produced in each collision

Each proton-proton collision, or event, produces a number of outgoing particles, whose tracks are recorded by a particle detector. The training data consisted of 250,000 labelled events, obtained from simulations of the collisions and detector. For each event, 30 features based on particle track data, such as particle mass, speed, and direction, were provided. The actual experimental data is highly unbalanced (much more background than signal), but the training data provided was fairly balanced.

Evaluation Metric: Significance of discovery

Instead of a standard performance metric, submissions were evaluated by Approximate Median Significance (AMS), a measure of how well a classifier is able to discover new phenomena.

When a classifier labels events (in experimental data, not simulations) as "signal", the p-value is the probability that those events were actually just background events. The corresponding "significance" (Z) is the normal quantile of that p-value, measured in units of sigma. We want a small p-value and large "significance", before claiming a discovery. Finally, AMS is an estimate of the "significance" for a given classifier. For more details, see the Kaggle evaluation page.

Making Particle Physics Accessible

The Organizers

The contest organizers did a fantastic job of designing and organizing this competition. A group of physicists and machine learning researchers, they took a difficult subject, particle physics, and spent 18 months creating a data challenge that ultimately attracted 1,785 teams (a Kaggle record).

Their success can be attributed to the following:

Designing a dataset that could be analyzed without any domain knowledge.
Providing multiple software starter kits that made it easy to get an analysis up and running.
Providing a 16-page document explaining the challenge.
Answering questions promptly on the Kaggle discussion forum.

Not needed to win the contest. "Higgs-gluon-fusion" by TimothyRias. Licensed under Creative Commons Attribution-Share Alike 3.0.

The Results

Did machine learning give data scientists an edge? How did the physicists fare?

Cross validation was key

It was not machine learning, but the statistical technique of cross validation, that was key to winning this competition. The AMS metric had a high variance, which made it difficult to know how well a model was performing. You could not rely on the score on the public leaderboard because it was calculated on only a small subset of the data. Therefore, a rigorous cross validation (CV) method was needed for model comparison and parameter tuning. The 1st-place team, Gabor Melis, did 2-fold stratified CV repeated 35 times on random shuffles of the training data. Both the 1st-place and 4th-place teams mentioned cross validation as key to a high score on the forum.

Insufficient CV could lead to overfitting and large shake-ups in the final ranking, as can be seen in a meta analysis of the competition scores.

Overfitting led to shake-ups in the final ranking. "Overfitting svg" by Gringer. Licensed under Creative Commons Attribution 3.0

Feature engineering

Feature engineering was less important in this contest, having a small impact on the score. The reason was the organizers had already designed good features, such as those used in the actual Higgs discovery, into the provided dataset. Participants did come up with good features, such as a feature named “Cake”, which discriminated between the signal and one major source of background events, the Z boson.

Software and Algorithms

The software libraries and algorithms discussed on the discussion forum were mostly standard ones: gradient tree boosting, bagging, Python scikit-learn, and R. The 1st-place entry used a bag of 70 dropout neural networks (overview, model and code). Early in the competition, Tianqi Chen shared a fast gradient boosting library called XGBoost, which was popular among participants.

The 1st-place entry used a bag of 70 dropout neural networks. "Colored neural network" by Glosser.ca. Licensed under Creative Commons Attribution-Share Alike 3.0

Particle Physics meets Machine Learning

The organizers' overall goal was to increase the cross fertilization of ideas between the particle physics and machine learning communities. In that respect, this challenge was a good first step.

Physicists who participated began to appreciate machine learning techniques. This quote from the forum summarizes that sentiment: "The process of attempting as a physicist to compete against ML experts has given us a new respect for a field that (through ignorance) none of us held in as high esteem as we do now."

On the machine learning side, one researcher, Lester Mackey, developed a method to optimize AMS directly. In scikit-learn, support for sample weights was added to the GradientBoostingClassifier, in part due to interest from this competition.

And, much more will be discussed at the upcoming NIPS workshop on particle physics and machine learning. There will also be a workshop on data challenges in machine learning.

There will be a workshop on particle physics and machine learning at the NIPS machine learning conference.

Personal Note

I studied physics at Caltech and Stanford / SLAC (Stanford Linear Accelerator Center), before switching to computer science. Therefore, I could not pass up the opportunity to participate in this challenge myself. This was my first Kaggle competition, in which I ranked in the top 10%.

Xinh's Tech Blog

Sunday, September 28, 2014

Higgs Data Challenge: Particle Physics meets Machine Learning