Return-Path: X-Original-To: apmail-spark-commits-archive@minotaur.apache.org Delivered-To: apmail-spark-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E7AB517A75 for ; Mon, 16 Mar 2015 05:30:07 +0000 (UTC) Received: (qmail 27670 invoked by uid 500); 16 Mar 2015 05:30:07 -0000 Delivered-To: apmail-spark-commits-archive@spark.apache.org Received: (qmail 27643 invoked by uid 500); 16 Mar 2015 05:30:07 -0000 Mailing-List: contact commits-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list commits@spark.apache.org Received: (qmail 27634 invoked by uid 99); 16 Mar 2015 05:30:07 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2015 05:30:07 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id 8C306AC0397 for ; Mon, 16 Mar 2015 05:30:07 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1666875 - /spark/site/docs/1.3.0/sql-programming-guide.html Date: Mon, 16 Mar 2015 05:30:07 -0000 To: commits@spark.apache.org From: pwendell@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150316053007.8C306AC0397@hades.apache.org> Author: pwendell Date: Mon Mar 16 05:30:06 2015 New Revision: 1666875 URL: http://svn.apache.org/r1666875 Log: Updating to incorperate doc changes in SPARK-6275 and SPARK-5310 Modified: spark/site/docs/1.3.0/sql-programming-guide.html Modified: spark/site/docs/1.3.0/sql-programming-guide.html URL: http://svn.apache.org/viewvc/spark/site/docs/1.3.0/sql-programming-guide.html?rev=1666875&r1=1666874&r2=1666875&view=diff ============================================================================== --- spark/site/docs/1.3.0/sql-programming-guide.html (original) +++ spark/site/docs/1.3.0/sql-programming-guide.html Mon Mar 16 05:30:06 2015 @@ -113,7 +113,7 @@
  • Overview
  • DataFrames
  • @@ -191,14 +193,14 @@

    All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell or the pyspark shell.

    -

    Starting Point: SQLContext

    +

    Starting Point: SQLContext

    The entry point into all functionality in Spark SQL is the -SQLContext class, or one of its -descendants. To create a basic SQLContext, all you need is a SparkContext.

    +SQLContext class, or one of its +descendants. To create a basic SQLContext, all you need is a SparkContext.

    val sc: SparkContext // An existing SparkContext.
     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    @@ -211,8 +213,8 @@ descendants.  To create a basic SQLConte
     

    The entry point into all functionality in Spark SQL is the -SQLContext class, or one of its -descendants. To create a basic SQLContext, all you need is a SparkContext.

    +SQLContext class, or one of its +descendants. To create a basic SQLContext, all you need is a SparkContext.

    JavaSparkContext sc = ...; // An existing JavaSparkContext.
     SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    @@ -222,8 +224,8 @@ descendants. To create a basic SQLConte

    The entry point into all relational functionality in Spark is the -SQLContext class, or one -of its decedents. To create a basic SQLContext, all you need is a SparkContext.

    +SQLContext class, or one +of its decedents. To create a basic SQLContext, all you need is a SparkContext.

    from pyspark.sql import SQLContext
     sqlContext = SQLContext(sc)
    @@ -231,20 +233,20 @@ of its decedents. To create a basic SQL
    -

    In addition to the basic SQLContext, you can also create a HiveContext, which provides a -superset of the functionality provided by the basic SQLContext. Additional features include +

    In addition to the basic SQLContext, you can also create a HiveContext, which provides a +superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the -ability to read data from Hive tables. To use a HiveContext, you do not need to have an -existing Hive setup, and all of the data sources available to a SQLContext are still available. -HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default -Spark build. If these dependencies are not a problem for your application then using HiveContext -is recommended for the 1.3 release of Spark. Future releases will focus on bringing SQLContext up -to feature parity with a HiveContext.

    +ability to read data from Hive tables. To use a HiveContext, you do not need to have an +existing Hive setup, and all of the data sources available to a SQLContext are still available. +HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default +Spark build. If these dependencies are not a problem for your application then using HiveContext +is recommended for the 1.3 release of Spark. Future releases will focus on bringing SQLContext up +to feature parity with a HiveContext.

    The specific variant of SQL that is used to parse queries can also be selected using the spark.sql.dialect option. This parameter can be changed using either the setConf method on -a SQLContext or by using a SET key=value command in SQL. For a SQLContext, the only dialect -available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the +a SQLContext or by using a SET key=value command in SQL. For a SQLContext, the only dialect +available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the default is “hiveql”, though “sql” is also available. Since the HiveQL parser is much more complete, this is recommended for most use cases.

    @@ -309,10 +311,10 @@ this is recommended for most use cases.< // Show the content of the DataFrame df.show() -// age name +// age name // null Michael -// 30 Andy -// 19 Justin +// 30 Andy +// 19 Justin // Print the schema in a tree format df.printSchema() @@ -322,17 +324,17 @@ this is recommended for most use cases.< // Select only the "name" column df.select("name").show() -// name +// name // Michael -// Andy -// Justin +// Andy +// Justin // Select everybody, but increment the age by 1 df.select("name", df("age") + 1).show() // name (age + 1) -// Michael null -// Andy 31 -// Justin 20 +// Michael null +// Andy 31 +// Justin 20 // Select people older than 21 df.filter(df("name") > 21).show() @@ -358,10 +360,10 @@ this is recommended for most use cases.< // Show the content of the DataFrame df.show(); -// age name +// age name // null Michael -// 30 Andy -// 19 Justin +// 30 Andy +// 19 Justin // Print the schema in a tree format df.printSchema(); @@ -371,17 +373,17 @@ this is recommended for most use cases.< // Select only the "name" column df.select("name").show(); -// name +// name // Michael -// Andy -// Justin +// Andy +// Justin // Select everybody, but increment the age by 1 df.select("name", df.col("age").plus(1)).show(); // name (age + 1) -// Michael null -// Andy 31 -// Justin 20 +// Michael null +// Andy 31 +// Justin 20 // Select people older than 21 df.filter(df("name") > 21).show(); @@ -407,10 +409,10 @@ this is recommended for most use cases.< # Show the content of the DataFrame df.show() -## age name +## age name ## null Michael -## 30 Andy -## 19 Justin +## 30 Andy +## 19 Justin # Print the schema in a tree format df.printSchema() @@ -420,17 +422,17 @@ this is recommended for most use cases.< # Select only the "name" column df.select("name").show() -## name +## name ## Michael -## Andy -## Justin +## Andy +## Justin # Select everybody, but increment the age by 1 df.select("name", df.age + 1).show() ## name (age + 1) -## Michael null -## Andy 31 -## Justin 20 +## Michael null +## Andy 31 +## Justin 20 # Select people older than 21 df.filter(df.name > 21).show() @@ -509,7 +511,7 @@ registered as a table. Tables can be us case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) +val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. @@ -917,7 +919,7 @@ new data.

    contents of the dataframe and create a pointer to the data in the HiveMetastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the table -method on a SQLContext with the name of the table.

    +method on a SQLContext with the name of the table.

    By default saveAsTable will create a “managed table”, meaning that the location of the data will be controlled by the metastore. Managed tables will also have their data deleted automatically @@ -1017,9 +1019,120 @@ of the original data.

    +

    Partition discovery

    + +

    Table partitioning is a common optimization approach used in systems like Hive. In a partitioned +table, data are usually stored in different directories, with partitioning column values encoded in +the path of each partition directory. The Parquet data source is now able to discover and infer +partitioning information automatically. For exmaple, we can store all our previously used +population data into a partitioned table using the following directory structure, with two extra +columns, gender and country as partitioning columns:

    + +
    path
    +└── to
    +    └── table
    +        ├── gender=male
    +        │   ├── ...
    +        │   │
    +        │   ├── country=US
    +        │   │   └── data.parquet
    +        │   ├── country=CN
    +        │   │   └── data.parquet
    +        │   └── ...
    +        └── gender=female
    +            ├── ...
    +            │
    +            ├── country=US
    +            │   └── data.parquet
    +            ├── country=CN
    +            │   └── data.parquet
    +            └── ...
    + +

    By passing path/to/table to either SQLContext.parquetFile or SQLContext.load, Spark SQL will +automatically extract the partitioning information from the paths. Now the schema of the returned +DataFrame becomes:

    + +
    root
    +|-- name: string (nullable = true)
    +|-- age: long (nullable = true)
    +|-- gender: string (nullable = true)
    +|-- country: string (nullable = true)
    + +

    Notice that the data types of the partitioning columns are automatically inferred. Currently, +numeric data types and string type are supported.

    + +

    Schema merging

    + +

    Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with +a simple schema, and gradually add more columns to the schema as needed. In this way, users may end +up with multiple Parquet files with different but mutually compatible schemas. The Parquet data +source is now able to automatically detect this case and merge schemas of all these files.

    + +
    + +
    + +
    // sqlContext from the previous example is used in this example.
    +// This is used to implicitly convert an RDD to a DataFrame.
    +import sqlContext.implicits._
    +
    +// Create a simple DataFrame, stored into a partition directory
    +val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
    +df1.saveAsParquetFile("data/test_table/key=1")
    +
    +// Create another DataFrame in a new partition directory,
    +// adding a new column and dropping an existing column
    +val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
    +df2.saveAsParquetFile("data/test_table/key=2")
    +
    +// Read the partitioned table
    +val df3 = sqlContext.parquetFile("data/test_table")
    +df3.printSchema()
    +
    +// The final schema consists of all 3 columns in the Parquet files together
    +// with the partiioning column appeared in the partition directory paths.
    +// root
    +// |-- single: int (nullable = true)
    +// |-- double: int (nullable = true)
    +// |-- triple: int (nullable = true)
    +// |-- key : int (nullable = true)
    + +
    + +
    + +
    # sqlContext from the previous example is used in this example.
    +
    +# Create a simple DataFrame, stored into a partition directory
    +df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
    +                                   .map(lambda i: Row(single=i, double=i * 2)))
    +df1.save("data/test_table/key=1", "parquet")
    +
    +# Create another DataFrame in a new partition directory,
    +# adding a new column and dropping an existing column
    +df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
    +                                   .map(lambda i: Row(single=i, triple=i * 3)))
    +df2.save("data/test_table/key=2", "parquet")
    +
    +# Read the partitioned table
    +df3 = sqlContext.parquetFile("data/test_table")
    +df3.printSchema()
    +
    +# The final schema consists of all 3 columns in the Parquet files together
    +# with the partiioning column appeared in the partition directory paths.
    +# root
    +# |-- single: int (nullable = true)
    +# |-- double: int (nullable = true)
    +# |-- triple: int (nullable = true)
    +# |-- key : int (nullable = true)
    + +
    + +
    +

    Configuration

    -

    Configuration of Parquet can be done using the setConf method on SQLContext or by running +

    Configuration of Parquet can be done using the setConf method on SQLContext or by running SET key=value commands using SQL.

    @@ -1082,7 +1195,7 @@ of the original data.

    Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. -This conversion can be done using one of two methods in a SQLContext:

    +This conversion can be done using one of two methods in a SQLContext:

    • jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
    • @@ -1124,7 +1237,7 @@ a regular multi-line JSON file will most

      Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. -This conversion can be done using one of two methods in a SQLContext :

      +This conversion can be done using one of two methods in a SQLContext :

      • jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
      • @@ -1167,7 +1280,7 @@ a regular multi-line JSON file will most

        Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. -This conversion can be done using one of two methods in a SQLContext:

        +This conversion can be done using one of two methods in a SQLContext:

        • jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
        • @@ -1197,7 +1310,7 @@ a regular multi-line JSON file will most # Register this DataFrame as a table. people.registerTempTable("people") -# SQL statements can be run by using the sql methods provided by sqlContext. +# SQL statements can be run by using the sql methods provided by `sqlContext`. teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") # Alternatively, a DataFrame can be created for a JSON dataset represented by @@ -1239,7 +1352,7 @@ on all of the worker nodes, as they will

          When working with Hive one must construct a HiveContext, which inherits from SQLContext, and adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do -not have an existing Hive deployment can still create a HiveContext. When not configured by the +not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.

          @@ -1403,7 +1516,7 @@ turning on some experimental options.

          sqlContext.uncacheTable("tableName") to remove the table from memory.

          -

          Configuration of in-memory caching can be done using the setConf method on SQLContext or by running +

          Configuration of in-memory caching can be done using the setConf method on SQLContext or by running SET key=value commands using SQL.

    @@ -1513,10 +1626,10 @@ your machine and a blank password. For s

    You may also use the beeline script that comes with Hive.

    -

    Thrift JDBC server also supports sending thrift RPC messages over HTTP transport. -Use the following setting to enable HTTP mode as system property or in hive-site.xml file in conf/:

    +

    Thrift JDBC server also supports sending thrift RPC messages over HTTP transport. +Use the following setting to enable HTTP mode as system property or in hive-site.xml file in conf/:

    -
    hive.server2.transport.mode - Set this to value: http 
    +
    hive.server2.transport.mode - Set this to value: http
     hive.server2.thrift.http.port - HTTP port number fo listen on; default is 10001
     hive.server2.http.endpoint - HTTP endpoint; default is cliservice
     
    @@ -1591,7 +1704,7 @@ case classes or tuples) with a method Spark 1.3 removes the type aliases that were present in the base sql package for DataType. Users should instead import the classes in org.apache.spark.sql.types

    -

    UDF Registration Moved to sqlContext.udf (Java & Scala)

    +

    UDF Registration Moved to sqlContext.udf (Java & Scala)

    Functions that are used to register UDFs, either for use in the DataFrame DSL or SQL, have been moved into the udf object in SQLContext.

    --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org For additional commands, e-mail: commits-help@spark.apache.org