IntroductionStrata has been one of the best conferences for data science, and this year's conference did not disappoint. It brought together developers, data scientists, startups, and business people who are interested in "making data work". It was divided into seven tracks, including design and "Hadoop in Practice". Spending most of my time in the "Beyond Hadoop" and "Data Science" tracks, I noticed one of the themes this year was real-time data processing.
Tutorial: Search and Real Time Analytics (slides)This was a really good tutorial presented by Ryan Tabora (Think Big Analytics) and Jason Rutherglen (Datastax). I learned that in addition to search, Solr has support for real time analytics: the equivalents of sort and group by queries in SQL (you can't do joins, however). An example application would be ad-hoc queries on streaming stock tick data. The second half of the tutorial was an in-depth look at Lucene and some use cases (O'Reilly book coming soon). Rutherglen also talked about the DataStax Enterprise platform which integrates Solr with Cassandra for scalability: Cassandra is the NoSQL data store for the raw data, and each Cassandra row maps into a Solr document.
Tutorial: Core Data Science Skills (code & slides)This was an interesting tutorial that introduced the basic methods and tools of supervised machine learning. It was led by William Cukierski and Ben Hamner, both from Kaggle. They talked about decision trees, random forests, and naive Bayes classifiers as the basic algorithms. And, they demo'ed analyses in R with R-Studio, and Python with IPython Notebook. The coolest part was the last hour, when all attendees practiced these skills by participating in a real Kaggle competition.
KeynotesAll of the keynote speakers were really good. I'll just highlight:
- Human Fault-tolerance (slides & video): Nathan Marz (Twitter) talked about the importance of immutability in distributed system design. I'm hoping to read more about it in his book on Big Data.
- Hidden Biases of Big Data (video): Kate Crawford (Microsoft Research) warned us that big data often does not tell the whole story, that context and small data are also needed.