Sunday, May 12, 2013

Real-time Big Data Analytics with Solr

You have heard of Solr as a search engine, but did you know it can be used for real-time analytics as well? I first encountered this idea at a tutorial on Solr at Strata Conference. The ideal use case is when you want to provide real-time data ingestion and analytics on streaming data, with an all-in-one solution that is horizontally scalable and fault-tolerant. And, because of Lucene, which is built into Solr, you also get text-search based queries for free.

Example: Stock Tick

Stock tick data is an example of a fast data stream: every second stock price updates come in as different stocks are traded on an exchange. Suppose you wanted to build a dashboard that displays real-time updates of stock prices as well as other statistics, such as the moving average. You also want to support ad-hoc queries that analyze the historical performance of various stocks. Your denormalized data rows may look something like:

symbol:AAPL, time:2013-05-10 08:15:23, price:415.92, volume:900, ...

Real Time

Real-time ingestion is possible because as soon as a new data row comes in, it is immediately added and indexed in Solr. Internally, Lucene handles high-volume inserts to its indexes with a log-structured merge-tree, in which new, small index segments get merged as an index grows. To further support high insert rates, a Solr index can be split into shards across multiple nodes of a cluster. Real-time queries are answered with fast lookups on the index. Sharding also helps speed up large queries by spreading the query processing across multiple nodes.

Analytics

Solr supports many of the same queries as SQL: aggregation, boolean filters and range queries, grouping, and sorting (or, in SQL, COUNT/SUM/AVG, WHERE, GROUP BY, and ORDER BY). As a tradeoff of horizontal scalability, there are no Joins; you would need to denormalize the data to reframe those queries as aggregations. Because of its search roots, Solr also supports text search in queries (for text fields), custom scoring (ranking) of query results, and facets.

Scalable, Fault-tolerant

With SolrCloud, the support for distributed indexes in Solr, it is easy to grow a cluster horizontally. SolrCloud takes care of load balancing, replication, and automatic fail-over of shards. Internally, Zookeeper is used for distributed coordination.

Gotchas

One of the drawbacks of using an indexing system is with schema changes. Adding new fields is easy, but changing the datatype or (text) analyzer of an existing field requires a full reindex of the raw data, which can be time consuming. Another drawback is with storage of large text blocks. While text fields are well supported, Solr does not efficiently store large blocks of text, such as the full text of a news article or web page. In these cases, a separate NoSQL data store such as HBase/Cassandra may be brought in to provide that storage.

Related Systems

I only discussed Solr above, but there are a couple of other open-source, Lucene-based search engines that have similar analytics capabilities. Elasticsearch was built specifically with a focus on ease of distributed deployment, and has a JSON-based API. SenseiDB features a SQL-like query language and Hadoop integration.

More Info

Slides from Strata tutorial.