Friday, July 29, 2016

Overview of Spark 2.0 Dataset / DataFrame API, Part 1

Introduction

Spark 2.0 features a new Dataset API. Now that Datasets support a full range of operations, you can avoid working with low-level RDDs in most cases. In 2.0, DataFrames no longer exist as a separate class; instead, DataFrame is defined as a special case of Dataset. Here is some example code to get you started with Spark 2.0 Datasets / DataFrames. Part 1 focuses on type-safe operations with Datasets, which provide compile time type safety. Part 2 focuses on DataFrames, which have untyped operations.

Part 1: Datasets: Type-safe operations. (This blog post)
Part 2: DataFrame: Untyped operations. (Next blog post)

Dataset vs. DataFrame

A Dataset[T] is a parameterized type, where the type T is specified by the user and is associated with each element of the Dataset. A DataFrame, on the other hand, has no explicit type associated with it at compile time, from the user's point of view. Internally, a DataFrame is defined as a Dataset[Row], where Row is a generic row type defined by Spark SQL.

Language

This blog post refers to the Scala API.

Outline

  • Reading Data In
  • Data Exploration
  • Statistics
  • Functional Transformations
  • Caching
  • Getting Data Out

Reading Data In

Spark supports a number of input formats, including Hive, JDBC, Parquet, CSV, and JSON. Below is an example of reading JSON data into a Dataset.

JSON example

Suppose you have this example JSON data, with one object per line:
{"name":"Alice", "dept":"Math", "age":21}
{"name":"Bob", "dept":"CS", "age":23}
{"name":"Carl", "dept":"Math", "age":25}

To read a JSON data file, first use the SparkSession object as an entry point, and access its DataFrameReader to read data into a DataFrame:

> val df = spark.read.json("/path/to/file.json") // "spark" is a SparkSession object
df1: org.apache.spark.sql.DataFrame

Then convert the DataFrame into Dataset[Student]:

case class Student(name: String, dept: String, age: Long)
val ds = df.as[Student]
ds: org.apache.spark.sql.Dataset[Student]

Data Exploration

When you first look into a new data set, you can explore its contents by printing out the schema, counting the number of rows, and displaying some of those rows.

Print Schema

To explore what is in this Dataset, you can print out the schema:
> ds.printSchema()
root |-- age: long (nullable = true) |-- dept: string (nullable = true) |-- name: string (nullable = true)

Count Rows

To count the number of rows:
> ds.count()
res2: Long = 3

Display Rows

To display the first few rows in tabular format:
> ds.show()
|age|dept| name| +---+----+-----+ | 21|Math|Alice| | 23| CS| Bob| | 25|Math| Carl|

Sample Rows

To get a sample of the data:

val sample = ds.sample(withReplacement=false, fraction=0.3)
sample: org.apache.spark.sql.Dataset[Student]
|age|dept|name| +---+----+----+ | 25|Math|Carl|

Statistics

A number of statistics functions are available for Datasets.

Summary Statistics

To get summary statistics on numerical fields, call "describe":

val summary = ds.describe()
summary: org.apache.spark.sql.DataFrame
|summary| age|
+-------+----+ | count| 3| | mean|23.0| | stddev| 2.0| | min| 21| | max| 25|

Additional Statistical Functions, Approximate Frequent Items

The "stat" method returns a DataFrameStatFunctions object for statistical functions:

> ds.stat
res11: org.apache.spark.sql.DataFrameStatFunctions

For example, "stat.freqItems" returns approximate frequent items for the given columns:

val approxFreqItems = ds.stat.freqItems(Seq("dept"))
approxFreqItems: org.apache.spark.sql.DataFrame
|dept_freqItems|
+--------------+ | [CS, Math]|

Functional Transformations


The Dataset API supports functional transformations, such as "filter" and "map", much like the RDD API. These operators transform one Dataset[T] into another Dataset[U], where T and U are user-specified types. These operations have compile-time type safety, in the sense that each row is associated with a Scala object of a fixed type T (or U). This is in contrast to DataFrames, which are untyped. "Reduce" is an action that reduces the elements of a Dataset into a scalar value.

Filter

To filter for rows that satisfy a given predicate:

> val youngStudents = ds.filter($"age" < 22)
youngStudents: org.apache.spark.sql.Dataset[Student]
|age|dept| name|
+---+----+-----+ | 21|Math|Alice|

Map

To map over rows with a given lambda function: 

val names = ds.map{_.name}
names: org.apache.spark.sql.Dataset[String]
|value|
+-----+ |Alice| | Bob| | Carl|

Reduce

To reduce the elements of a Dataset with a given reducer function:

val totalAge = ds.map(_.age).reduce(_ + _)
totalAge: Long = 69

Join

You can join two Datasets. Suppose you want to join the "Students" Dataset with a new "Department" Dataset:

case class Department(name: String, building: Int)
val depts = Seq(Department("Math", 125), Department("CS", 110)).toDS()
|name|building|
+----+--------+ |Math| 125| | CS| 110|

To join the Students" Dataset with the new "Department" Dataset:

val joined = ds.joinWith(depts, ds("dept") === depts("name"))
joined: org.apache.spark.sql.Dataset[(Student, Department)]
| _1| _2|
+---------------+----------+ |[21,Math,Alice]|[Math,125]| | [23,CS,Bob]| [CS,110]| | [25,Math,Carl]|[Math,125]|

GroupByKey, Aggregation

To group elements of a Dataset and aggregate within each group:

val deptSizes = ds.groupByKey(_.dept).count()
deptSizes: org.apache.spark.sql.Dataset[(String, Long)]
|value|count(1)|
+-----+--------+ | Math| 2| | CS| 1|

Additional aggregation functions are available in the "functions" object. The "avg" function calculates an average for each group:

import org.apache.spark.sql.functions._
val avgAge = ds.groupByKey(_.dept)
                             .agg(avg($"age").as[Double])
avgAge: org.apache.spark.sql.Dataset[(String, Double)]
|value|avg(age)|
+-----+--------+ | Math| 23.0| | CS| 23.0|

OrderBy

To order by a given set of fields:

val ordered = ds.orderBy("dept", "name")
ordered: org.apache.spark.sql.Dataset[Student]
|age|dept| name|
+---+----+-----+ | 23| CS| Bob| | 21|Math|Alice| | 25|Math| Carl|

Caching

To persist a Dataset at the default storage level (memory and disk):
> ds.cache()

Getting Data Out

Into an Array

To collect data into a Scala Array, use "collect". Note that this will collect all rows into the Driver node, and thus could potentially be a memory- and IO- intensive operation.

> val studentArr = ds.collect()
studentArr: Array[Student] = Array(Student(Alice,Math,21), Student(Bob,CS,23), Student(Carl,Math,25))

To collect only the first few rows into a Scala Array:
val firstTwo = ds.head(2)
firstTwo: Array[Student] = Array(Student(Alice,Math,21), Student(Bob,CS,23))

Into an RDD

To convert into an RDD:

val studentRdd = ds.rdd
studentRdd: org.apache.spark.rdd.RDD[Student]

Into a File

To write a Dataset into a file, use "write". A number of output formats are supported. Here is an example of writing in JSON format:

> ds.write.json("/path/to/file.json")

Continuation

To read about untyped operations with DataFrames, continue onto part 2.

No comments: