spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chenxiao Mao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation
Date Tue, 28 Aug 2018 16:30:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595240#comment-16595240
] 

Chenxiao Mao commented on SPARK-25175:
--------------------------------------

[~cloud_fan] Does it make sense?

> Field resolution should fail if there is ambiguity for ORC data source native implementation
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25175
>                 URL: https://issues.apache.org/jira/browse/SPARK-25175
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Chenxiao Mao
>            Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet
files. We found ORC files have similar issues, but not identical to Parquet. Spark has two
OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency.
This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity
mode. When there is ambiguity, hive OrcFileFormat always returns the first matched field,
rather than failing the reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat
supports case-insensitive field resolution, however it cannot handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data file has
more fields than table schema, we just can't read hive serde tables. If ORC data file does
not have more fields, hive serde tables always do field resolution by ordinal, rather than
by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc InputFormat/SerDe
to read table. I'm not sure whether we can change underlying hive classes to make all orc
read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl consistent with
Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message