spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shuai Lin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
Date Sun, 15 Jan 2017 10:51:26 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823103#comment-15823103
] 

Shuai Lin commented on SPARK-19153:
-----------------------------------

I find it's quite straight forward to remove the restriction of partitioned-by for the {{create
table t1 using hive partitioned by (c1,c2) as select ..."}} CTAS statement.

But another problem comes up: the partition columns must be on the right most of the schema,
otherwise the schema we stored in the table property of metastore (with the property key "spark.sql.sources.schema")
would be inconsistent with the schema we read back from hive client api.

The reason is, when creating a hive table in the metastore, the schema and partition columns
are disjoint sets (as required by hive client api). And when we reading it back, we append
the partition columns to the end of the schema to get the catalyst schema, i.e.:
{code}
// HiveClientImpl.scala
val partCols = h.getPartCols.asScala.map(fromHiveColumn)
val schema = StructType(h.getCols.asScala.map(fromHiveColumn) ++ partCols)
{code}
It's not a problem before we have the unified "create table" syntax, because in the old create
hive table syntax we have to specify the normal columns and partition columns separately,
e.g. {{create table t1 (id int, name string) partitioned by (dept string)}} .

Now that we can create partitioned table using hive format, e.g. {{create table t1 (id int,
name string, dept string) using hive partitioned by (name)}}, the partition column may not
be the last columns, so I think we need to reorder the schema so the partition columns would
be the last ones. This is consistent with data source tables, e.g.

{code}
scala> sql("create table t1 (id int, name string, dept string) using parquet partitioned
by (name)")
scala> spark.table("t1").schema.fields.map(_.name)
res44: Array[String] = Array(id, dept, name)
{code}

[~cloud_fan] Does this sound good to you?


> DataFrameWriter.saveAsTable should work with hive format to create partitioned table
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-19153
>                 URL: https://issues.apache.org/jira/browse/SPARK-19153
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message