Friday, April 24, 2015

Spark for Exploratory Data Analysis?

Python and R have been known for their data analysis packages and environments. But, now that Spark supports DataFrames, will it be possible to do exploratory data analysis with Spark? Assuming the production system is implemented in Spark for scalability, it would be nice to do the initial data exploration within the same framework.

At first glance, all the major components are available. With Spark SQL, you can load a variety of different data formats, such as JSON, Hive, Parquet, and JDBC, and manipulate the data with SQL. Since the data is stored in RDDs (with schema), you can also process it with the original RDD APIs, as well as algorithms and utilities in MLLib.

Of course, the details matter, so without having done a real world project in this framework, I have to wonder: what is missing? Is there a critical data frame function in Pandas or R, that is not yet supported in Spark? Are there other missing pieces that are critical to real world data analysis? How difficult is it to patch up those missing pieces by linking in external libraries?