spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Li (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-16552) Store the Inferred Schemas into External Catalog Tables when Creating Tables
Date Thu, 14 Jul 2016 20:08:20 GMT
Xiao Li created SPARK-16552:
-------------------------------

             Summary: Store the Inferred Schemas into External Catalog Tables when Creating
Tables
                 Key: SPARK-16552
                 URL: https://issues.apache.org/jira/browse/SPARK-16552
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Xiao Li


Currently, in Spark SQL, the initial creation of schema can be classified into two groups.
It is applicable to both Hive tables and Data Source tables:

Group A. Users specify the schema. 

Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema of the SELECT
clause. For example,
{noformat}
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input
{noformat}

Case 2 CREATE TABLE: users explicitly specify the schema. For example,
{noformat}
CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json
{noformat}

Group B. Spark SQL infer the schema at runtime.

Case 3 CREATE TABLE. Users do not specify the schema but the path to the file location. For
example,
{noformat}
CREATE TABLE jsonTable 
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')
{noformat}

Now, Spark SQL does not store the inferred schema in the external catalog for the cases in
Group B. When users refreshing the metadata cache, accessing the table at the first time after
(re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache
for improving the performance of subsequent metadata requests. However, the runtime schema
inference could cause undesirable schema changes after each reboot of Spark.

It is desirable to store the inferred schema in the external catalog when creating the table.
When users intend to refresh the schema, they issue `REFRESH TABLE`. Spark SQL will infer
the schema again based on the previously specified table location and update/refresh the schema
in the external catalog and metadata cache. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message