spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself
Date Tue, 31 Mar 2015 13:30:52 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388520#comment-14388520
] 

Apache Spark commented on SPARK-6632:
-------------------------------------

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/5298

> Optimize the parquetSchema to metastore schema reconciliation, so that the process is
delegated to each map task itself
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6632
>                 URL: https://issues.apache.org/jira/browse/SPARK-6632
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Yash Datta
>             Fix For: 1.4.0
>
>
> Currently in ParquetRelation2, schema from all the part files is first merged, and then
reconciled with metastore schema. This approach does not scale in case we have thousands of
partitions for the table. We can take a different approach where we can go ahead with the
metastore schema, and reconcile the names of the columns within each map task , using ReadSupport
hooks provided in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message