Xinh's Tech Blog: data munging

Showing posts with label data munging. Show all posts

Friday, July 29, 2016

Overview of Spark 2.0 Dataset / DataFrame API, Part 2

Introduction

In Part 1 of this series, we examined type-safe operations with Datasets. In Part 2, we will cover untyped operations with DataFrames. Being untyped, DataFrames are well-suited for data exploration, ad-hoc queries, and data munging.

DataFrame

DataFrames are still available in Spark 2.0, and remain mostly unchanged. The biggest change is that they have been merged with the new Dataset API. The DataFrame class no longer exists on its own; instead, it is defined as a specific type of Dataset: type DataFrame = Dataset[Row]. However, all of the functionality from 1.6 is still there.

Outline

Example Data
DataFrames: Untyped Language-Integrated SQL Queries
DataFrames: SQL Queries in SQL
DataFrames: Adding Columns, Data Munging

Example Data

We will continue with the example data from Part 1. We have defined a "Student" class as:

> case class Student(name: String, dept: String, age: Long)

The example data has been read into a Dataset[Student]:

> ds

ds: org.apache.spark.sql.Dataset[Student]
|age|dept| name|

+---+----+-----+ | 21|Math|Alice| | 23| CS| Bob| | 25|Math| Carl|

DataFrames: Untyped Language-Integrated SQL Queries

DataFrames supports language-integrated SQL queries, such as "select", "where", and "group by".

Convert to DataFrame

To convert a Dataset into a DataFrame:

> val df = ds.toDF()
df: org.apache.spark.sql.DataFrame

Select, Where

To select columns and specify a "where" clause:

> val selected = df.select("name", "age")
.where($"age" === 21)
selected: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
| name|age| +-----+---+ |Alice| 21|

Count

To count the number of rows:

> df.count()
res1: Long = 3

GroupBy, Aggregate

To perform a "group by" and aggregate within each group, use "groupBy" and "agg". A number of aggregation functions, such as "avg", are available in the "functions" object. To group by a column and compute the average in each group:

> import org.apache.spark.sql.functions._
> val avgAge2 = df.groupBy("dept")
.agg(avg($"age"))
|dept|avg(age)| +----+--------+ |Math| 23.0| | CS| 23.0|

Join

You can join two DataFrames with "join". To create a second DataFrame with department info:

> case class Department(deptName: String, building: Int)
> val depts = Seq(Department("Math", 125), Department("CS", 110)).toDF()
|deptName|building| +--------+--------+ | Math| 125| | CS| 110|

Then, to join the students DataFrame with the new department DataFrame:

> val joined2 = df.join(depts, df("dept") === depts("deptName"))
|age|dept| name|deptName|building| +---+----+-----+--------+--------+ | 21|Math|Alice| Math| 125| | 23| CS| Bob| CS| 110| | 25|Math| Carl| Math| 125|

Explain

To examine the query plan used to compute a DataFrame:

> joined2.explain()

== Physical Plan == *BroadcastHashJoin [dept#134], [deptName#384], Inner, BuildRight :- *Filter isnotnull(dept#134) : +- Scan ExistingRDD[age#133L,dept#134,name#135] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false])) +- *Filter isnotnull(deptName#384) +- LocalTableScan [deptName#384, building#385]

DataFrames: SQL Queries in SQL

You can also query DataFrames with SQL. First create a temp view and then specify SQL queries against that view:

DataFrames: Adding Columns, Data Munging

DataFrames support creating new columns and data munging. To add a column, use "withColumn" to specify a new column name and an expression for column values. The Column class defines column operations, such as the minus operator shown below. The "functions" object also contains convenient functions for working with columns, such as math, string, and date / time functions.

In this example, the "lit" function, defined in "functions", returns a Column populated with a literal value. The Column class minus operator performs subtraction. The "$" method returns the Column associated with the given column name:

> import org.apache.spark.sql.functions._

> val withCol = df.withColumn("birthYear", lit(2016) - $"age")
|age|dept| name|birthYear| +---+----+-----+---------+ | 21|Math|Alice| 1995| | 23| CS| Bob| 1993| | 25|Math| Carl| 1991|

In the next example, the "round" function, defined in "functions", rounds values to the nearest tens digit:

> val rounded = df.withColumn("roundedAge", round($"age", -1))
|age|dept| name|roundedAge| +---+----+-----+----------+ | 21|Math|Alice| 20| | 23| CS| Bob| 20| | 25|Math| Carl| 30|

Summary

In Spark 2.0, DataFrames have been merged into the DataSet API. DataFrame is a special type of Dataset that has untyped operations. DataFrames support convenient ways to query data, either through language-integrated queries or SQL. DataFrames are also useful in creating new columns and data munging.

Friday, May 20, 2016

Overview of Spark DataFrame API

Introduction

Spark DataFrames were introduced in early 2015, in Spark 1.3. Since then, a lot of new functionality has been added in Spark 1.4, 1.5, and 1.6. More than a year later, Spark's DataFrame API provides a rich set of operations for data munging, SQL queries, and analytics. This post will give an overview of all the major features of Spark's DataFrame API, focusing on the Scala API in 1.6.1.

Outline

Classes and Objects
Reading Data
Traditional Dataframe Operations
Lazy Eval and collect()
SQL (Relational) Operations
Data Munging
Analytics

Classes and Objects

Let us start by reviewing the major classes and objects in the DataFrame API. The main ones are SQLContext, DataFrame, Column, and functions.

SQLContext, DataFrame, and Column Classes

SQLContext is the main entry point for creating DataFrames.
DataFrame is the main class representing the DataFrame data and operations.
The Column class represents an individual column of a DataFrame.

Functions Object

The functions object contains functions for aggregation, math, and date/time and string manipulation that can be applied on DataFrame columns.

Reading Data into a DataFrame

JSON, Parquet, JDBC, Hive, CSV

DataFrames can read from a large number of source data formats, such as JSON, Parquet, JDBC, and Hive. See the DataFrameReader class for some of the natively supported formats and Spark Packages for packages available for other formats, such as CSV and many others.

Reading JSON Example

Here is an example of reading JSON data into a DataFrame. The input file must contain one JSON object on each line:

> val df = sqlContext.read.json("/home/data.json")

df: org.apache.spark.sql.DataFrame = [col1: int, col2: int]

In the above example, sqlContext is of type SQLContext, its read() method returns a DataFrameReader, and the reader's json() method reads the specified data file.

Traditional Dataframe Operations

Spark DataFrames support traditional dataframe operations that you might expect from working with Pandas or R dataframes. You can select columns and rows, create new columns, and apply functions on column values.

Selecting Columns

To select one or more columns:

> df.select("col1")

|col1| +----+ | 1| | 2|

Selecting Rows

To select rows based on a boolean filter:

> df.filter(df("col1") > 1)

|col1|col2| +----+----+ | 2| 6|

In the above example, df("col1") is of type Column, and ">" is a method defined in the Column class. Alternatively, a column can be specified with the $"col1" syntax.

More methods in class Column: ===, !==, isNaN, isNull, isin, like, startsWith, endsWith

Creating New Columns

To create a new column derived from existing ones, use the withColumn() method:

> df.withColumn("col3", df("col1") + df("col2"))

|col1|col2|col3| +----+----+----+ | 1| 5| 6| | 2| 6| 8|

More methods in class Column: %, *, -, /, bitwiseAND, bitwiseOR, cast, &&, ||

Math Functions on Columns

A number of math functions, defined in the functions object, such as sqrt(), can be applied to column values:

> import org.apache.spark.sql.functions._

> df.select(df("col1"), sqrt(df("col1")))

|col1| SQRT(col1)| +----+------------------+ | 1| 1.0| | 2|1.4142135623730951|

You can also define your only functions on columns, via UDFs. See the "UDF" section below. Here are some more predefined math functions (not comprehensive) in functions:

cos, sin, tan, exp, log, pow, cbrt, hypot, toDegrees, toRadians, ceil, floor, round, rint, pmod, shiftLeft, shiftRight

Displaying Data and Schema

To display a DataFrame:

> df.show()

|col1|col2| +----+----+ | 1| 5| | 2| 6|

To display a DataFrame's column names and types:

> df.printSchema()

root |-- col1: integer (nullable = false) |-- col2: integer (nullable = false)

Lazy Eval and collect()

DataFrames are evaluated lazily, which means that no computation takes place until you perform an action. Any non-action method will thus return immediately, in most cases. An action is any method that produces output that is not a DataFrame, such as displaying data on the console, converting data into Scala Arrays, or saving data into a file or database.

To convert a DataFrame into an array, use the collect() method:

> df.collect()

res0: Array[org.apache.spark.sql.Row] = Array([1,5], [2,6])

To convert only the first n rows, use head or take.

SQL (Relational) Operations

DataFrames also support SQL (relational) operations, such as SELECT, WHERE, GROUP BY, Aggregate, and JOIN. You can also define UDFs (user-defined functions).

SELECT, WHERE

To do a SELECT with a WHERE clause:

> df.select("col1", "col2")

.where($"col1" === 1)

|col1|col2| +----+----+ | 1| 5|

Note that "===" is a Column method that tests for equality. More methods in class Column:
>, <, !==, isNaN, isNull, isin, like, startsWith, endsWith

GROUP BY, Aggregate

To do a GROUP BY and aggregation:

> df1.show()

|col1|col2| +----+----+ | 1| 5| | 2| 6| | 2| 7|

> import org.apache.spark.sql.functions._

> df1.groupBy("col1")

.agg(sum("col2").as("sum_col2"))

|col1|sum_col2| +----+--------+ | 1| 5| | 2| 13|

Note that the groupBy() method returns a GroupedData object, on which we call the agg() method to perform one or more aggregations. The sum() function is one of the aggregation functions defined in the functions object.

More aggregation functions in functions: approxCountDistinct, avg, corr, count, countDistinct, first, last, max, mean, min, skewness, stddev, sumDistinct, variance (and more)

JOIN

To do a JOIN between two DataFrames:

> people.show()

| id| name| +---+-----+ | 1|Alice| | 2| Bob|

> people.join(df, people("id") === df("col1"))

| id| name|col1|col2| +---+-----+----+----+ | 1|Alice| 1| 5| | 2| Bob| 2| 6|

The above example shows an inner join; other join types, such as outer join, are also supported.

More SQL operations

See also in class DataFrame:
alias, as, cube, distinct, drop, dropDuplicates, intersect, limit, na, orderBy, repartition, rollup, selectExpr, sort, unionAll, withColumnRenamed

UDFs

To define a UDF, use the udf function:

> import org.apache.spark.sql.functions.udf

val myUdf = udf {(n: Int) => (n * 2) + 1}

You can then apply the UDF on one or more Columns:

> df.select(df("col1"), myUdf(df("col1")))

|col1|UDF(col1)| +----+---------+ | 1| 3| | 2| 5|

Functions for Data Munging

There are a variety of functions to simplify data munging on date, timestamp, string, and nested data in DataFrames. These functions are defined in the functions object.

Dates and Timestamps

Here is an example of working with dates and timestamps. The date_add() function adds or subtracts days from a given date. The unix_timestamp() function returns a Unix timestamp corresponding to a timestamp string in a specified format:

> import org.apache.spark.sql.functions._

> df6.withColumn("day_before", date_add(df6("date"), -1))

.withColumn("unix_time", unix_timestamp(df6("date"), "yyyy-MM-dd"))

| date|day_before| unix_time| +----------+----------+----------+ |2016-01-01|2015-12-31|1451606400| |2016-09-05|2016-09-04|1473033600|

See also (not comprehensive) in functions: current_date, current_timestamp, date_sub, datediff, dayofmonth, dayofyear, from_unixtime, hour, last_day, minute, month, next_day, quarter, second, trunc, weekofyear, year

Strings

There are also a number of functions for working with strings. Here is one example, with the substring() function, which returns a substring given an input string, position, and length:

> df6.withColumn("month_day", substring(df6("date"), 6, 5))

| date|month_day|

+----------+---------+ |2016-01-01| 01-01| |2016-09-05| 09-05|

See also (not comprehensive) in functions: ascii, concat, decode, encode, format_number, format_string, length, levenshtein, lower, lpad, ltrim, regexp_extract, regexp_replace, repeat, rtrim, split, translate, trim, upper

Nested Data Structures

With certain data formats, such as JSON, it is common to have nested arrays and structs in the schema. The functions object includes functions for working with nested columns. For example, if a column is of type Array, such as "col2" below, you can use the explode() function to flatten the data inside that column:

> df8

|col1| col2| +----+--------+ | 1|[1a, 1b]| | 2| [2a]|

> df8.select(df8("col1"), explode(df8("col2")).as("col2_flat"))

|col1|col2_flat| +----+---------+ | 1| 1a| | 1| 1b| | 2| 2a|

The new flattened column, "col2_flat", can now be manipulated as an ordinary top-level column. For more about nested array data, please see my post on the topic.

Analytics

The DataFrame API includes functionality for analytics, namely, summary statistics, window functions, and pivot tables.

Summary Statistics

The describe() method computes summary statistics for numerical columns and is meant for exploratory data analysis:

> df.describe()

|summary| col1| col2| +-------+------------------+------------------+ | count| 2| 2| | mean| 1.5| 5.5| | stddev|0.7071067811865476|0.7071067811865476| | min| 1| 5| | max| 2| 6|

The stat() method returns a DataFrameStatFunctions object, which provides additional statistics functions such as:

corr, cov, crosstab, freqItems, sampleBy

Window Functions

Window functions allow you to perform calculations over a moving window of rows, and are the basis for calculating a moving average or cumulative sum. You can apply window functions on DataFrames by defining a WindowSpec:

> import org.apache.spark.sql.expressions.Window

> val wSpec2 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)

The above window spec for a moving average consists of three components: (1) partition by, (2) order by, and (3) a frame. To use this WindowSpec in a DataFrame, you would apply a window function or aggregation function, such as avg() over this WindowSpec:

> customers.withColumn("movingAvg", avg(customers("amountSpent")).over(wSpec2))

| name| date|amountSpent|movingAvg| +-----+----------+-----------+---------+ |Alice|2016-05-01| 50.0| 47.5| |Alice|2016-05-03| 45.0| 50.0| |Alice|2016-05-04| 55.0| 50.0| | Bob|2016-05-01| 25.0| 27.0| | Bob|2016-05-04| 29.0| 27.0| | Bob|2016-05-06| 27.0| 28.0|

For a list of all the possible window functions and aggregation functions, please see the functions object. For more examples of using window functions, please see my blog post on the topic.

For more window functions, see in functions: cume_dist, lag, lead, ntile, percent_rank, rank, row_number (and more)

Pivot Tables

You can create pivot tables with the pivot() method:

> df7.groupBy("col1").pivot("col2").avg("col3")

|col1| A| B| +----+----+----+ | 1|10.0|21.0| | 2|12.0|20.0|

In the above example, the DataFrame is grouped by col1, pivoted along col2, which contains the values "A" and "B", and computes the average of col3 in each group. The pivot() method is defined in the GroupedData class. For more information about pivoting, please see this Databricks article: Reshaping Data with Pivot in Apache Spark.

Summary

By now, you should have a good feel for what is possible with the Spark DataFrame Scala API. From data munging, to SQL, to analytics, this API provides a broad range of functionality for working with big data. For more information about DataFrames, see the Spark programming guide.

Thursday, April 28, 2016

Introduction to Spark for Developers and Data Scientists

What is Spark?

Spark is “a fast and general engine for large-scale data processing”. – http://spark.apache.org/

Spark is also one of the most popular open source frameworks for big data, based on number of contributors. Let us find out why this is the case.

When do you use Spark?

Suppose you would like to analyze a data set: perform ETL or data munging, then run SQL queries such as grouping and aggregations against the data, and maybe apply a machine learning algorithm. When the data size is small, everything will run quickly on a single machine, and you can use analysis tools like Pandas (Python), R, or Excel, or write your own scripts. But, for larger data sets, data processing will be too slow on a single machine, and then you will want to move to a cluster of machines. This is when you would use Spark.

You could probably benefit from Spark if:

Your data is currently stored in Hadoop / HDFS.
Your data set contains more than 100 million rows.
Ad-hoc queries take longer than 5 minutes to complete.

What Spark is Not, typical architecture

Spark can be a central component to a big data system, but it is not the only component. It is not a distributed file system: you would typically store your data on HDFS or S3. Nor is Spark a NoSQL database: Cassandra or HBase would be a better place for horizontally scalable table storage. And, it is not a message queue: you would use Kafka or Flume to collect streaming event data. Spark is, however, a compute engine which can take input or send output to all of these other systems.

How do you use Spark?

Implemented in Scala, Spark can be programmed in Scala, Java, Python, SQL, and R. However, not all of the latest functionality is immediately available in all languages.

What kind of operations does Spark support?

Spark SQL. Spark supports batch operations involved in ETL and data munging, via the DataFrame API. It supports parsing different input formats, such as JSON or Parquet. Once the raw data is loaded, you can easily compute new columns from existing columns. You can slice and dice the data by filtering, grouping, aggregating, and joining with other tables. Spark supports relational queries, which you can express in SQL or through the DataFrame API.

Spark Streaming. Spark also provides scalable stream processing. Given an input data stream, for example, coming from Kafka, Spark allows you to perform operations on the streaming data, such as map, reduce, join, and window.

MLlib. Spark includes a machine learning library, with scalable algorithms for classification, regression, collaborative filtering, clustering, and more. If training your dataset on a single machine takes too long, you might consider cluster computing with Spark.

GraphX. Finally, GraphX is a component in Spark for scalable batch processing on graphs.

As you may have noticed by now, Spark processing is batch oriented. It works best when you want to perform the same operation on all of your data, or a large subset of your data. Even with Spark Streaming, you operate on small batches of the data stream, rather than one event at a time.

Spark vs. Hadoop MapReduce, Hive, Impala

How does Spark compare with other big data compute engines? Unlike Hadoop MapReduce, Spark caches data in memory for huge performance gains when you have ad-hoc queries or iterative workloads, which are common in machine learning algorithms. Hive and Impala both run SQL queries at scale; the advantage of Spark over these systems is (1) the convenience of writing both queries and UDFs in the same language, such as Scala, and (2) support for machine learning algorithms, streaming data, and graph processing within the same system.

Conclusion

This overview has explained what Spark is, when to use it, what kinds of operations it supports, and how it compares with other big data systems. To learn more, please take a look at the Spark website (http://spark.apache.org/).