spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6538) Add missing nullable Metastore fields when merging a Parquet schema
Date Thu, 26 Mar 2015 17:16:53 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382257#comment-14382257
] 

Apache Spark commented on SPARK-6538:
-------------------------------------

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/5214

> Add missing nullable Metastore fields when merging a Parquet schema
> -------------------------------------------------------------------
>
>                 Key: SPARK-6538
>                 URL: https://issues.apache.org/jira/browse/SPARK-6538
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Adam Budde
>
> When Spark SQL infers a schema for a DataFrame, it will take the union of all field types
present in the structured source data (e.g. an RDD of JSON data). When the source data for
a row doesn't define a particular field on the DataFrame's schema, a null value will simply
be assumed for this field. This workflow makes it very easy to construct tables and query
over a set of structured data with a nonuniform schema. However, this behavior is not consistent
in some cases when dealing with Parquet files and an external table managed by an external
Hive metastore.
> In our particular usecase, we use Spark Streaming to parse and transform our input data
and then apply a window function to save an arbitrary-sized batch of data as a Parquet file,
which itself will be added as a partition to an external Hive table via an "ALTER TABLE...
ADD PARTITION..." statement. Since our input data is nonuniform, it is expected that not every
partition batch will contain every field present in the table's schema obtained from the Hive
metastore. As such, we expect that the schema of some of our Parquet files may not contain
the same set fields present in the full metastore schema.
> In such cases, it seems natural that Spark SQL would simply assume null values for any
missing fields in the partition's Parquet file, assuming these fields are specified as nullable
by the metastore schema. This is not the case in the current implementation of ParquetRelation2.
The mergeMetastoreParquetSchema() method used to reconcile differences between a Parquet file's
schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet
file doesn't match the same set of fields specified by the metastore.
> I propose altering this implementation in order to allow for any missing metastore fields
marked as nullable to be merged in to the Parquet file's schema before continuing with the
checks present in mergeMetastoreParquetSchema().
> Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you feel this
should be an improvement or new feature instead, please feel free to reclassify this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message