Xinh's Tech Blog: Spark for Exploratory Data Analysis?

Friday, April 24, 2015

Spark for Exploratory Data Analysis?

Python and R have been known for their data analysis packages and environments. But, now that Spark supports DataFrames, will it be possible to do exploratory data analysis with Spark? Assuming the production system is implemented in Spark for scalability, it would be nice to do the initial data exploration within the same framework.

At first glance, all the major components are available. With Spark SQL, you can load a variety of different data formats, such as JSON, Hive, Parquet, and JDBC, and manipulate the data with SQL. Since the data is stored in RDDs (with schema), you can also process it with the original RDD APIs, as well as algorithms and utilities in MLLib.

Of course, the details matter, so without having done a real world project in this framework, I have to wonder: what is missing? Is there a critical data frame function in Pandas or R, that is not yet supported in Spark? Are there other missing pieces that are critical to real world data analysis? How difficult is it to patch up those missing pieces by linking in external libraries?

Spark DataFrames: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Spark SQL: https://spark.apache.org/docs/latest/sql-programming-guide.html
Spark MLLib: https://spark.apache.org/docs/latest/mllib-guide.html

5 comments:

Xinh HuynhMay 23, 2015 at 2:53 PM
One big part of exploratory data analysis is data visualization. For Spark, there is a "notebook" type tool that provides that: Zeppelin, https://zeppelin.incubator.apache.org
ReplyDelete
Replies
Shivam KumarNovember 23, 2015 at 10:13 AM
This would help read this articles read json file in java
ReplyDelete
Replies
Imran YounusDecember 6, 2015 at 2:42 PM
This article is identical to this

https://www.linkedin.com/pulse/spark-exploratory-data-analysis-gaurhari-dass
ReplyDelete
Replies
UnknownJuly 19, 2018 at 3:18 AM
Well done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful Data Science tutorial on youtube so you can check that too for more detailed knowledge on Data Science.https://www.youtube.com/watch?v=8gFu30KW-ek&t=270s
ReplyDelete
Replies

Add comment

New comments are not allowed.