Getting Started

Starting Point: SparkSession

The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:

frompyspark.sqlimportSparkSessionspark=SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option","some-value") \ .getOrCreate()

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():

importorg.apache.spark.sql.SparkSessionvalspark=SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option","some-value").getOrCreate()

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():

importorg.apache.spark.sql.SparkSession;SparkSessionspark=SparkSession.builder().appName("Java Spark SQL basic example").config("spark.some.config.option","some-value").getOrCreate();

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

The entry point into all functionality in Spark is the SparkSession class. To initialize a basic SparkSession, just call sparkR.session():

sparkR.session(appName="R Spark SQL basic example",sparkConfig=list(spark.some.config.option="some-value"))

Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.

Note that when invoked for the first time, sparkR.session() initializes a global SparkSession singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the SparkSession once, then SparkR functions like read.df will be able to access this global instance implicitly, and users don’t need to pass the SparkSession instance around.

SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup.

Creating DataFrames

With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

As an example, the following creates a DataFrame based on the content of a JSON file:

# spark is an existing SparkSession df=spark.read.json("examples/src/main/resources/people.json")# Displays the content of the DataFrame to stdout df.show()# +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

As an example, the following creates a DataFrame based on the content of a JSON file:

valdf=spark.read.json("examples/src/main/resources/people.json")// Displays the content of the DataFrame to stdoutdf.show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

As an example, the following creates a DataFrame based on the content of a JSON file:

importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;Dataset<Row>df=spark.read().json("examples/src/main/resources/people.json");// Displays the content of the DataFrame to stdoutdf.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

With a SparkSession, applications can create DataFrames from a local R data.frame, from a Hive table, or from Spark data sources.

As an example, the following creates a DataFrame based on the content of a JSON file:

df<-read.json("examples/src/main/resources/people.json")# Displays the content of the DataFramehead(df)## age name## 1 NA Michael## 2 30 Andy## 3 19 Justin# Another method to print the first few rows and optionally truncate the printing of long valuesshowDF(df)## +----+-------+## | age| name|## +----+-------+## |null|Michael|## | 30| Andy|## | 19| Justin|## +----+-------+

Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.

Untyped Dataset Operations (aka DataFrame Operations)

DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.

As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.

Here we include some basic examples of structured data processing using Datasets:

In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.

# spark, df are from the previous example # Print the schema in a tree format df.printSchema()# root # |-- age: long (nullable = true) # |-- name: string (nullable = true) # Select only the "name" column df.select("name").show()# +-------+ # | name| # +-------+ # |Michael| # | Andy| # | Justin| # +-------+ # Select everybody, but increment the age by 1 df.select(df['name'],df['age']+1).show()# +-------+---------+ # | name|(age + 1)| # +-------+---------+ # |Michael| null| # | Andy| 31| # | Justin| 20| # +-------+---------+ # Select people older than 21 df.filter(df['age']>21).show()# +---+----+ # |age|name| # +---+----+ # | 30|Andy| # +---+----+ # Count people by age df.groupBy("age").count().show()# +----+-----+ # | age|count| # +----+-----+ # | 19| 1| # |null| 1| # | 30| 1| # +----+-----+

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

For a complete list of the types of operations that can be performed on a DataFrame refer to the API Documentation.

In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the DataFrame Function Reference.

// This import is needed to use the $-notationimportspark.implicits._// Print the schema in a tree formatdf.printSchema()// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show()// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select($"name",$"age"+1).show()// +-------+---------+// | name|(age + 1)|// +-------+---------+// |Michael| null|// | Andy| 31|// | Justin| 20|// +-------+---------+// Select people older than 21df.filter($"age">21).show()// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show()// +----+-----+// | age|count|// +----+-----+// | 19| 1|// |null| 1|// | 30| 1|// +----+-----+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

For a complete list of the types of operations that can be performed on a Dataset, refer to the API Documentation.

In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the DataFrame Function Reference.

// col("...") is preferable to df.col("...")importstaticorg.apache.spark.sql.functions.col;// Print the schema in a tree formatdf.printSchema();// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show();// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select(col("name"),col("age").plus(1)).show();// +-------+---------+// | name|(age + 1)|// +-------+---------+// |Michael| null|// | Andy| 31|// | Justin| 20|// +-------+---------+// Select people older than 21df.filter(col("age").gt(21)).show();// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show();// +----+-----+// | age|count|// +----+-----+// | 19| 1|// |null| 1|// | 30| 1|// +----+-----+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

For a complete list of the types of operations that can be performed on a Dataset refer to the API Documentation.

# Create the DataFramedf<-read.json("examples/src/main/resources/people.json")# Show the content of the DataFramehead(df)## age name## 1 NA Michael## 2 30 Andy## 3 19 Justin# Print the schema in a tree formatprintSchema(df)## root## |-- age: long (nullable = true)## |-- name: string (nullable = true)# Select only the "name" columnhead(select(df,"name"))## name## 1 Michael## 2 Andy## 3 Justin# Select everybody, but increment the age by 1head(select(df,df$name,df$age+1))## name (age + 1.0)## 1 Michael NA## 2 Andy 31## 3 Justin 20# Select people older than 21head(where(df,df$age>21))## age name## 1 30 Andy# Count people by agehead(count(groupBy(df,"age")))## age count## 1 19 1## 2 NA 1## 3 30 1

Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.

For a complete list of the types of operations that can be performed on a DataFrame refer to the API Documentation.

Running SQL Queries Programmatically

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

# Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people")sqlDF=spark.sql("SELECT * FROM people")sqlDF.show()# +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")valsqlDF=spark.sql("SELECT * FROM people")sqlDF.show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a Dataset<Row>.

importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people");Dataset<Row>sqlDF=spark.sql("SELECT * FROM people");sqlDF.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

The sql function enables applications to run SQL queries programmatically and returns the result as a SparkDataFrame.

df<-sql("SELECT * FROM table")

Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.

Global Temporary View

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it, e.g. SELECT * FROM global_temp.view1.

# Register the DataFrame as a global temporary view df.createGlobalTempView("people")# Global temporary view is tied to a system preserved database `global_temp` spark.sql("SELECT * FROM global_temp.people").show()# +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+ # Global temporary view is cross-session spark.newSession().sql("SELECT * FROM global_temp.people").show()# +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people")// Global temporary view is tied to a system preserved database `global_temp`spark.sql("SELECT * FROM global_temp.people").show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+// Global temporary view is cross-sessionspark.newSession().sql("SELECT * FROM global_temp.people").show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people");// Global temporary view is tied to a system preserved database `global_temp`spark.sql("SELECT * FROM global_temp.people").show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+// Global temporary view is cross-sessionspark.newSession().sql("SELECT * FROM global_temp.people").show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

CREATEGLOBALTEMPORARYVIEWtemp_viewASSELECTa+1,b*2FROMtblSELECT*FROMglobal_temp.temp_view

Creating Datasets

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

caseclassPerson(name:String,age:Long)// Encoders are created for case classesvalcaseClassDS=Seq(Person("Andy",32)).toDS()caseClassDS.show()// +----+---+// |name|age|// +----+---+// |Andy| 32|// +----+---+// Encoders for most common types are automatically provided by importing spark.implicits._valprimitiveDS=Seq(1,2,3).toDS()primitiveDS.map(_+1).collect()// Returns: Array(2, 3, 4)// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by namevalpath="examples/src/main/resources/people.json"valpeopleDS=spark.read.json(path).as[Person]peopleDS.show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

importjava.util.Arrays;importjava.util.Collections;importjava.io.Serializable;importorg.apache.spark.api.java.function.MapFunction;importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.Encoder;importorg.apache.spark.sql.Encoders;publicstaticclassPersonimplementsSerializable{privateStringname;privatelongage;publicStringgetName(){returnname;}publicvoidsetName(Stringname){this.name=name;}publiclonggetAge(){returnage;}publicvoidsetAge(longage){this.age=age;}}// Create an instance of a Bean classPersonperson=newPerson();person.setName("Andy");person.setAge(32);// Encoders are created for Java beansEncoder<Person>personEncoder=Encoders.bean(Person.class);Dataset<Person>javaBeanDS=spark.createDataset(Collections.singletonList(person),personEncoder);javaBeanDS.show();// +---+----+// |age|name|// +---+----+// | 32|Andy|// +---+----+// Encoders for most common types are provided in class EncodersEncoder<Long>longEncoder=Encoders.LONG();Dataset<Long>primitiveDS=spark.createDataset(Arrays.asList(1L,2L,3L),longEncoder);Dataset<Long>transformedDS=primitiveDS.map((MapFunction<Long,Long>)value->value+1L,longEncoder);transformedDS.collect();// Returns [2, 3, 4]// DataFrames can be converted to a Dataset by providing a class. Mapping based on nameStringpath="examples/src/main/resources/people.json";Dataset<Person>peopleDS=spark.read().json(path).as(personEncoder);peopleDS.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

Interoperating with RDDs

Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection-based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

Inferring the Schema Using Reflection

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.

frompyspark.sqlimportRowsc=spark.sparkContext# Load a text file and convert each line to a Row. lines=sc.textFile("examples/src/main/resources/people.txt")parts=lines.map(lambdal:l.split(","))people=parts.map(lambdap:Row(name=p[0],age=int(p[1])))# Infer the schema, and register the DataFrame as a table. schemaPeople=spark.createDataFrame(people)schemaPeople.createOrReplaceTempView("people")# SQL can be run over DataFrames that have been registered as a table. teenagers=spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")# The results of SQL queries are Dataframe objects. # rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`. teenNames=teenagers.rdd.map(lambdap:"Name: "+p.name).collect()fornameinteenNames:print(name)# Name: Justin

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Seqs or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.

// For implicit conversions from RDDs to DataFramesimportspark.implicits._// Create an RDD of Person objects from a text file, convert it to a DataframevalpeopleDF=spark.sparkContext.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(attributes=>Person(attributes(0),attributes(1).trim.toInt)).toDF()// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people")// SQL statements can be run by using the sql methods provided by SparkvalteenagersDF=spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")// The columns of a row in the result can be accessed by field indexteenagersDF.map(teenager=>"Name: "+teenager(0)).show()// +------------+// | value|// +------------+// |Name: Justin|// +------------+// or by field nameteenagersDF.map(teenager=>"Name: "+teenager.getAs[String]("name")).show()// +------------+// | value|// +------------+// |Name: Justin|// +------------+// No pre-defined encoders for Dataset[Map[K,V]], define explicitlyimplicitvalmapEncoder=org.apache.spark.sql.Encoders.kryo[Map[String, Any]]// Primitive types and case classes can be also defined as// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]teenagersDF.map(teenager=>teenager.getValuesMap[Any](List("name","age"))).collect()// Array(Map("name" -> "Justin", "age" -> 19))

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

Spark SQL supports automatically converting an RDD of JavaBeans into a DataFrame. The BeanInfo, obtained using reflection, defines the schema of the table. Currently, Spark SQL does not support JavaBeans that contain Map field(s). Nested JavaBeans and List or Array fields are supported though. You can create a JavaBean by creating a class that implements Serializable and has getters and setters for all of its fields.

importorg.apache.spark.api.java.JavaRDD;importorg.apache.spark.api.java.function.Function;importorg.apache.spark.api.java.function.MapFunction;importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.Encoder;importorg.apache.spark.sql.Encoders;// Create an RDD of Person objects from a text fileJavaRDD<Person>peopleRDD=spark.read().textFile("examples/src/main/resources/people.txt").javaRDD().map(line->{String[]parts=line.split(",");Personperson=newPerson();person.setName(parts[0]);person.setAge(Integer.parseInt(parts[1].trim()));returnperson;});// Apply a schema to an RDD of JavaBeans to get a DataFrameDataset<Row>peopleDF=spark.createDataFrame(peopleRDD,Person.class);// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people");// SQL statements can be run by using the sql methods provided by sparkDataset<Row>teenagersDF=spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");// The columns of a row in the result can be accessed by field indexEncoder<String>stringEncoder=Encoders.STRING();Dataset<String>teenagerNamesByIndexDF=teenagersDF.map((MapFunction<Row,String>)row->"Name: "+row.getString(0),stringEncoder);teenagerNamesByIndexDF.show();// +------------+// | value|// +------------+// |Name: Justin|// +------------+// or by field nameDataset<String>teenagerNamesByFieldDF=teenagersDF.map((MapFunction<Row,String>)row->"Name: "+row.<String>getAs("name"),stringEncoder);teenagerNamesByFieldDF.show();// +------------+// | value|// +------------+// |Name: Justin|// +------------+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

Programmatically Specifying the Schema

When a dictionary of kwargs cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

Create an RDD of tuples or lists from the original RDD;
Create the schema represented by a StructType matching the structure of tuples or lists in the RDD created in the step 1.
Apply the schema to the RDD via createDataFrame method provided by SparkSession.

For example:

# Import data types frompyspark.sql.typesimportStringType,StructType,StructFieldsc=spark.sparkContext# Load a text file and convert each line to a Row. lines=sc.textFile("examples/src/main/resources/people.txt")parts=lines.map(lambdal:l.split(","))# Each line is converted to a tuple. people=parts.map(lambdap:(p[0],p[1].strip()))# The schema is encoded in a string. schemaString="name age"fields=[StructField(field_name,StringType(),True)forfield_nameinschemaString.split()]schema=StructType(fields)# Apply the schema to the RDD. schemaPeople=spark.createDataFrame(people,schema)# Creates a temporary view using the DataFrame schemaPeople.createOrReplaceTempView("people")# SQL can be run over DataFrames that have been registered as a table. results=spark.sql("SELECT name FROM people")results.show()# +-------+ # | name| # +-------+ # |Michael| # | Andy| # | Justin| # +-------+

Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

For example:

importorg.apache.spark.sql.Rowimportorg.apache.spark.sql.types._// Create an RDDvalpeopleRDD=spark.sparkContext.textFile("examples/src/main/resources/people.txt")// The schema is encoded in a stringvalschemaString="name age"// Generate the schema based on the string of schemavalfields=schemaString.split(" ").map(fieldName=>StructField(fieldName,StringType,nullable=true))valschema=StructType(fields)// Convert records of the RDD (people) to RowsvalrowRDD=peopleRDD.map(_.split(",")).map(attributes=>Row(attributes(0),attributes(1).trim))// Apply the schema to the RDDvalpeopleDF=spark.createDataFrame(rowRDD,schema)// Creates a temporary view using the DataFramepeopleDF.createOrReplaceTempView("people")// SQL can be run over a temporary view created using DataFramesvalresults=spark.sql("SELECT name FROM people")// The results of SQL queries are DataFrames and support all the normal RDD operations// The columns of a row in the result can be accessed by field index or by field nameresults.map(attributes=>"Name: "+attributes(0)).show()// +-------------+// | value|// +-------------+// |Name: Michael|// | Name: Andy|// | Name: Justin|// +-------------+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

When JavaBean classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a Dataset<Row> can be created programmatically with three steps.

Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

For example:

importjava.util.ArrayList;importjava.util.List;importorg.apache.spark.api.java.JavaRDD;importorg.apache.spark.api.java.function.Function;importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.types.DataTypes;importorg.apache.spark.sql.types.StructField;importorg.apache.spark.sql.types.StructType;// Create an RDDJavaRDD<String>peopleRDD=spark.sparkContext().textFile("examples/src/main/resources/people.txt",1).toJavaRDD();// The schema is encoded in a stringStringschemaString="name age";// Generate the schema based on the string of schemaList<StructField>fields=newArrayList<>();for(StringfieldName:schemaString.split(" ")){StructFieldfield=DataTypes.createStructField(fieldName,DataTypes.StringType,true);fields.add(field);}StructTypeschema=DataTypes.createStructType(fields);// Convert records of the RDD (people) to RowsJavaRDD<Row>rowRDD=peopleRDD.map((Function<String,Row>)record->{String[]attributes=record.split(",");returnRowFactory.create(attributes[0],attributes[1].trim());});// Apply the schema to the RDDDataset<Row>peopleDataFrame=spark.createDataFrame(rowRDD,schema);// Creates a temporary view using the DataFramepeopleDataFrame.createOrReplaceTempView("people");// SQL can be run over a temporary view created using DataFramesDataset<Row>results=spark.sql("SELECT name FROM people");// The results of SQL queries are DataFrames and support all the normal RDD operations// The columns of a row in the result can be accessed by field index or by field nameDataset<String>namesDS=results.map((MapFunction<Row,String>)row->"Name: "+row.getString(0),Encoders.STRING());namesDS.show();// +-------------+// | value|// +-------------+// |Name: Michael|// | Name: Andy|// | Name: Justin|// +-------------+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.

Scalar Functions

Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of Built-in Scalar Functions. It also supports User Defined Scalar Functions.

Aggregate Functions

Aggregate functions are functions that return a single value on a group of rows. The Built-in Aggregate Functions provide common aggregations such as count(), count_distinct(), avg(), max(), min(), etc. Users are not limited to the predefined aggregate functions and can create their own. For more details about user defined aggregate functions, please refer to the documentation of User Defined Aggregate Functions.

Spark SQL Guide

Getting Started

Starting Point: SparkSession

Creating DataFrames

Untyped Dataset Operations (aka DataFrame Operations)

Running SQL Queries Programmatically

Global Temporary View

Creating Datasets

Interoperating with RDDs

Inferring the Schema Using Reflection

Programmatically Specifying the Schema

Scalar Functions

Aggregate Functions