All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell or the pyspark shell.

The entry point into all functionality in Spark SQL is the -SQLContext class, or one of its -descendants. To create a basic SQLContext, all you need is a SparkContext.

+SQLContext class, or one of its +descendants. To create a basic SQLContext, all you need is a SparkContext.

val sc: SparkContext // An existing SparkContext.
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
@@ -211,8 +213,8 @@ descendants.  To create a basic SQLConte
 
 
     The entry point into all functionality in Spark SQL is the
-SQLContext class, or one of its
-descendants.  To create a basic SQLContext, all you need is a SparkContext.
+SQLContext class, or one of its
+descendants.  To create a basic SQLContext, all you need is a SparkContext.
 
     JavaSparkContext sc = ...; // An existing JavaSparkContext.
 SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
@@ -222,8 +224,8 @@ descendants.  To create a basic SQLConte
 
 
     The entry point into all relational functionality in Spark is the
-SQLContext class, or one
-of its decedents.  To create a basic SQLContext, all you need is a SparkContext.
+SQLContext class, or one
+of its decedents.  To create a basic SQLContext, all you need is a SparkContext.
 
     from pyspark.sql import SQLContext
 sqlContext = SQLContext(sc)
@@ -231,20 +233,20 @@ of its decedents.  To create a basic SQL
   
 
 
-In addition to the basic SQLContext, you can also create a HiveContext, which provides a
-superset of the functionality provided by the basic SQLContext. Additional features include
+
In addition to the basic SQLContext, you can also create a HiveContext, which provides a
+superset of the functionality provided by the basic SQLContext. Additional features include
 the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
-ability to read data from Hive tables.  To use a HiveContext, you do not need to have an
-existing Hive setup, and all of the data sources available to a SQLContext are still available.
-HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default
-Spark build.  If these dependencies are not a problem for your application then using HiveContext
-is recommended for the 1.3 release of Spark.  Future releases will focus on bringing SQLContext up
-to feature parity with a HiveContext.
+ability to read data from Hive tables.  To use a HiveContext, you do not need to have an
+existing Hive setup, and all of the data sources available to a SQLContext are still available.
+HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default
+Spark build.  If these dependencies are not a problem for your application then using HiveContext
+is recommended for the 1.3 release of Spark.  Future releases will focus on bringing SQLContext up
+to feature parity with a HiveContext.
 
 The specific variant of SQL that is used to parse queries can also be selected using the
 spark.sql.dialect option.  This parameter can be changed using either the setConf method on
-a SQLContext or by using a SET key=value command in SQL.  For a SQLContext, the only dialect
-available is “sql” which uses a simple SQL parser provided by Spark SQL.  In a HiveContext, the
+a SQLContext or by using a SET key=value command in SQL.  For a SQLContext, the only dialect
+available is “sql” which uses a simple SQL parser provided by Spark SQL.  In a HiveContext, the
 default is “hiveql”, though “sql” is also available.  Since the HiveQL parser is much more complete,
 this is recommended for most use cases.
 
@@ -309,10 +311,10 @@ this is recommended for most use cases.<
 
 // Show the content of the DataFrame
 df.show()
-// age  name   
+// age  name
 // null Michael
-// 30   Andy   
-// 19   Justin 
+// 30   Andy
+// 19   Justin
 
 // Print the schema in a tree format
 df.printSchema()
@@ -322,17 +324,17 @@ this is recommended for most use cases.<
 
 // Select only the "name" column
 df.select("name").show()
-// name   
+// name
 // Michael
-// Andy   
-// Justin 
+// Andy
+// Justin
 
 // Select everybody, but increment the age by 1
 df.select("name", df("age") + 1).show()
 // name    (age + 1)
-// Michael null     
-// Andy    31       
-// Justin  20       
+// Michael null
+// Andy    31
+// Justin  20
 
 // Select people older than 21
 df.filter(df("name") > 21).show()
@@ -358,10 +360,10 @@ this is recommended for most use cases.<
 
 // Show the content of the DataFrame
 df.show();
-// age  name   
+// age  name
 // null Michael
-// 30   Andy   
-// 19   Justin 
+// 30   Andy
+// 19   Justin
 
 // Print the schema in a tree format
 df.printSchema();
@@ -371,17 +373,17 @@ this is recommended for most use cases.<
 
 // Select only the "name" column
 df.select("name").show();
-// name   
+// name
 // Michael
-// Andy   
-// Justin 
+// Andy
+// Justin
 
 // Select everybody, but increment the age by 1
 df.select("name", df.col("age").plus(1)).show();
 // name    (age + 1)
-// Michael null     
-// Andy    31       
-// Justin  20       
+// Michael null
+// Andy    31
+// Justin  20
 
 // Select people older than 21
 df.filter(df("name") > 21).show();
@@ -407,10 +409,10 @@ this is recommended for most use cases.<
 
 # Show the content of the DataFrame
 df.show()
-## age  name   
+## age  name
 ## null Michael
-## 30   Andy   
-## 19   Justin 
+## 30   Andy
+## 19   Justin
 
 # Print the schema in a tree format
 df.printSchema()
@@ -420,17 +422,17 @@ this is recommended for most use cases.<
 
 # Select only the "name" column
 df.select("name").show()
-## name   
+## name
 ## Michael
-## Andy   
-## Justin 
+## Andy
+## Justin
 
 # Select everybody, but increment the age by 1
 df.select("name", df.age + 1).show()
 ## name    (age + 1)
-## Michael null     
-## Andy    31       
-## Justin  20       
+## Michael null
+## Andy    31
+## Justin  20
 
 # Select people older than 21
 df.filter(df.name > 21).show()
@@ -509,7 +511,7 @@ registered as a table.  Tables can be us
 case class Person(name: String, age: Int)
 
 // Create an RDD of Person objects and register it as a table.
-val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
+val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
 people.registerTempTable("people")
 
 // SQL statements can be run by using the sql methods provided by sqlContext.
@@ -917,7 +919,7 @@ new data.
 contents of the dataframe and create a pointer to the data in the HiveMetastore.  Persistent tables
 will still exist even after your Spark program has restarted, as long as you maintain your connection
 to the same metastore.  A DataFrame for a persistent table can be created by calling the table
-method on a SQLContext with the name of the table.
+method on a SQLContext with the name of the table.
 
 By default saveAsTable will create a “managed table”, meaning that the location of the data will
 be controlled by the metastore.  Managed tables will also have their data deleted automatically
@@ -1017,9 +1019,120 @@ of the original data.


 
+Partition discovery
+
+Table partitioning is a common optimization approach used in systems like Hive.  In a partitioned
+table, data are usually stored in different directories, with partitioning column values encoded in
+the path of each partition directory.  The Parquet data source is now able to discover and infer
+partitioning information automatically.  For exmaple, we can store all our previously used
+population data into a partitioned table using the following directory structure, with two extra
+columns, gender and country as partitioning columns:
+
+path
+âââ to
+    âââ table
+        âââ gender=male
+        âÂ Â  âââ ...
+        âÂ Â  â
+        âÂ Â  âââ country=US
+        âÂ Â  âÂ Â  âââ data.parquet
+        âÂ Â  âââ country=CN
+        âÂ Â  âÂ Â  âââ data.parquet
+        âÂ Â  âââ ...
+        âââ gender=female
+         Â Â  âââ ...
+         Â Â  â
+         Â Â  âââ country=US
+         Â Â  âÂ Â  âââ data.parquet
+         Â Â  âââ country=CN
+         Â Â  âÂ Â  âââ data.parquet
+         Â Â  âââ ...
+
+By passing path/to/table to either SQLContext.parquetFile or SQLContext.load, Spark SQL will
+automatically extract the partitioning information from the paths.  Now the schema of the returned
+DataFrame becomes:
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- gender: string (nullable = true)
+|-- country: string (nullable = true)
+
+Notice that the data types of the partitioning columns are automatically inferred.  Currently,
+numeric data types and string type are supported.
+
+Schema merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution.  Users can start with
+a simple schema, and gradually add more columns to the schema as needed.  In this way, users may end
+up with multiple Parquet files with different but mutually compatible schemas.  The Parquet data
+source is now able to automatically detect this case and merge schemas of all these files.
+
+
+
+
+
+    // sqlContext from the previous example is used in this example.
+// This is used to implicitly convert an RDD to a DataFrame.
+import sqlContext.implicits._
+
+// Create a simple DataFrame, stored into a partition directory
+val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
+df1.saveAsParquetFile("data/test_table/key=1")
+
+// Create another DataFrame in a new partition directory,
+// adding a new column and dropping an existing column
+val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
+df2.saveAsParquetFile("data/test_table/key=2")
+
+// Read the partitioned table
+val df3 = sqlContext.parquetFile("data/test_table")
+df3.printSchema()
+
+// The final schema consists of all 3 columns in the Parquet files together
+// with the partiioning column appeared in the partition directory paths.
+// root
+// |-- single: int (nullable = true)
+// |-- double: int (nullable = true)
+// |-- triple: int (nullable = true)
+// |-- key : int (nullable = true)
+
+  
+
+
+
+    # sqlContext from the previous example is used in this example.
+
+# Create a simple DataFrame, stored into a partition directory
+df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
+                                   .map(lambda i: Row(single=i, double=i * 2)))
+df1.save("data/test_table/key=1", "parquet")
+
+# Create another DataFrame in a new partition directory,
+# adding a new column and dropping an existing column
+df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
+                                   .map(lambda i: Row(single=i, triple=i * 3)))
+df2.save("data/test_table/key=2", "parquet")
+
+# Read the partitioned table
+df3 = sqlContext.parquetFile("data/test_table")
+df3.printSchema()
+
+# The final schema consists of all 3 columns in the Parquet files together
+# with the partiioning column appeared in the partition directory paths.
+# root
+# |-- single: int (nullable = true)
+# |-- double: int (nullable = true)
+# |-- triple: int (nullable = true)
+# |-- key : int (nullable = true)
+
+  
+
+
+
 Configuration
 
-Configuration of Parquet can be done using the setConf method on SQLContext or by running
+
Configuration of Parquet can be done using the setConf method on SQLContext or by running
 SET key=value commands using SQL.
 
 
@@ -1082,7 +1195,7 @@ of the original data.

     Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
-This conversion can be done using one of two methods in a SQLContext:
+This conversion can be done using one of two methods in a SQLContext:
 
     
       jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
@@ -1124,7 +1237,7 @@ a regular multi-line JSON file will most
 
 
     Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
-This conversion can be done using one of two methods in a SQLContext :
+This conversion can be done using one of two methods in a SQLContext :
 
     
       jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
@@ -1167,7 +1280,7 @@ a regular multi-line JSON file will most
 
 
     Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
-This conversion can be done using one of two methods in a SQLContext:
+This conversion can be done using one of two methods in a SQLContext:
 
     
       jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
@@ -1197,7 +1310,7 @@ a regular multi-line JSON file will most
 # Register this DataFrame as a table.
 people.registerTempTable("people")
 
-# SQL statements can be run by using the sql methods provided by sqlContext.
+# SQL statements can be run by using the sql methods provided by `sqlContext`.
 teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
 
 # Alternatively, a DataFrame can be created for a JSON dataset represented by
@@ -1239,7 +1352,7 @@ on all of the worker nodes, as they will
 
     When working with Hive one must construct a HiveContext, which inherits from SQLContext, and
 adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do
-not have an existing Hive deployment can still create a HiveContext.  When not configured by the
+not have an existing Hive deployment can still create a HiveContext.  When not configured by the
 hive-site.xml, the context automatically creates metastore_db and warehouse in the current
 directory.
 
@@ -1403,7 +1516,7 @@ turning on some experimental options.
sqlContext.uncacheTable("tableName") to remove the table from memory.
 
-Configuration of in-memory caching can be done using the setConf method on SQLContext or by running
+
Configuration of in-memory caching can be done using the setConf method on SQLContext or by running
 SET key=value commands using SQL.
 
 

 
 
@@ -1513,10 +1626,10 @@ your machine and a blank password. For s
 
 You may also use the beeline script that comes with Hive.
 
-Thrift JDBC server also supports sending thrift RPC messages over HTTP transport. 
-Use the following setting to enable HTTP mode as system property or in hive-site.xml file in conf/: 
+Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.
+Use the following setting to enable HTTP mode as system property or in hive-site.xml file in conf/:
 
-hive.server2.transport.mode - Set this to value: http 
+hive.server2.transport.mode - Set this to value: http
 hive.server2.thrift.http.port - HTTP port number fo listen on; default is 10001
 hive.server2.http.endpoint - HTTP endpoint; default is cliservice
 
@@ -1591,7 +1704,7 @@ case classes or tuples) with a method Spark 1.3 removes the type aliases that were present in the base sql package for DataType. Users
 should instead import the classes in org.apache.spark.sql.types
 
-UDF Registration Moved to sqlContext.udf (Java & Scala)
+UDF Registration Moved to sqlContext.udf (Java & Scala)
 
 Functions that are used to register UDFs, either for use in the DataFrame DSL or SQL, have been
 moved into the udf object in SQLContext.



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

Starting Point: SQLContext

Starting Point: `SQLContext`

Partition discovery

Schema merging

Configuration

UDF Registration Moved to sqlContext.udf (Java & Scala)

UDF Registration Moved to `sqlContext.udf` (Java & Scala)