spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Serge Smertin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
Date Thu, 21 Sep 2017 12:31:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174660#comment-16174660
] 

Serge Smertin commented on SPARK-18727:
---------------------------------------

in one of the use-cases for project in [#comment-15987668] by [~simeons] - adding fields to
nested _struct_ fields. application is built the way that parquet files are created/partitioned
outside of Spark and only new columns might be added. Again, mostly within couple of nested
structs. 

I don't know all potential implications of the idea, but can we just use the last element
of selected files instead of the first one, as long as the FileStatus [list is already sorted
by path lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
It easier to guarantee that only new columns would be added over the time. And the following
code change doesn't seem to be huge deviation from current behavior, thus tremendously saving
time compared to {{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
            .orElse(filesByType.metadata.lastOption)
            .orElse(filesByType.data.lastOption)
{code}

/cc [~rxin@databricks.com] [~xwu0226] 

> Support schema evolution as new files are inserted into table
> -------------------------------------------------------------
>
>                 Key: SPARK-18727
>                 URL: https://issues.apache.org/jira/browse/SPARK-18727
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Eric Liang
>            Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, one issue
for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, which does
not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, or automatically
as new files with compatible schemas are appended into the table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message