Introduction
Strata has been one of the best conferences for data science, and
this year's conference did not disappoint. It brought together developers, data scientists, startups, and business people who are interested in "making data work". It was divided into seven tracks, including design and "Hadoop in Practice". Spending most of my time in the "Beyond Hadoop" and "Data Science" tracks, I noticed one of the themes this year was real-time data processing.
Tutorial: Search and Real Time Analytics (slides)
This was a really good
tutorial presented by Ryan Tabora (Think Big Analytics) and Jason Rutherglen (Datastax). I learned that in addition to search, Solr has support for real time analytics: the equivalents of
sort and
group by queries in SQL (you can't do joins, however). An example application would be ad-hoc queries on streaming stock tick data. The second half of the tutorial was an in-depth look at Lucene and some use cases (
O'Reilly book coming soon). Rutherglen also talked about the DataStax Enterprise platform which
integrates Solr with Cassandra for scalability: Cassandra is the NoSQL data store for the raw data, and each Cassandra row maps into a Solr document.
Tutorial: Core Data Science Skills (code & slides)
This was an interesting
tutorial that introduced the basic methods and tools of supervised machine learning. It was led by William Cukierski and Ben Hamner, both from Kaggle. They talked about decision trees, random forests, and naive Bayes classifiers as the basic algorithms. And, they demo'ed analyses in R with R-Studio, and Python with IPython Notebook. The coolest part was the last hour, when all attendees practiced these skills by participating in a real
Kaggle competition.
Keynotes
All of the keynote speakers were really good. I'll just highlight:
- Human Fault-tolerance (slides & video): Nathan Marz (Twitter) talked about the importance of immutability in distributed system design. I'm hoping to read more about it in his book on Big Data.
- Hidden Biases of Big Data (video): Kate Crawford (Microsoft Research) warned us that big data often does not tell the whole story, that context and small data are also needed.
Sketching Techniques for Real-time Big Data (slides)
Bahman Bahmani (Stanford) explained that sketches of data are useful in streaming computation because they take up little memory, and allow for fast updates and queries. One example of a sketching data structure is a
bloom filter. Bahmani described sketches for fast approximate counting as well as on-the-fly PageRank computation.
Sight, a Short Film (video)
This was an amazing short film depicting a futuristic augmented reality. It envisions a future that brings together many of the ideas from the conference: mobile, connected world, recommendations, gamification, and ubiquitous Internet of things.
Real Time Systems
There were a number of talks about ingesting and processing big data in real time. I'll cover them in a future post.