spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Allman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
Date Tue, 19 Jul 2016 16:30:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384464#comment-15384464
] 

Michael Allman commented on SPARK-15705:
----------------------------------------

Yeah, I'm not too keen on that code path either. Inferring the schema from reading every file
in a partitioned table is a relatively heavyweight and slow operation. We're not leveraging
the fact that we have a metastore schema. This is a performance/efficiency issue I've been
working on in our own codebase, and I believe we'll have something to contribute in the near
future. However, that will definitely not make it into 2.0. As for your problem specifically,
I can't really suggest a quick fix because I have no experience with the ORC file format.

> Spark won't read ORC schema from metastore for partitioned tables
> -----------------------------------------------------------------
>
>                 Key: SPARK-15705
>                 URL: https://issues.apache.org/jira/browse/SPARK-15705
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>         Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1)
>            Reporter: Nic Eggert
>            Priority: Critical
>
> Spark does not seem to read the schema from the Hive metastore for partitioned tables
stored as ORC files. It appears to read the schema from the files themselves, which, if they
were created with Hive, does not match the metastore schema (at least not before before Hive
2.0, see HIVE-4243). To reproduce:
> In Hive:
> {code}
> hive> create table default.test (id BIGINT, name STRING) partitioned by (state STRING)
stored as orc;
> hive> insert into table default.test partition (state="CA") values (1, "mike"), (2,
"steve"), (3, "bill");
> {code}
> In Spark
> {code}
> scala> spark.table("default.test").printSchema
> {code}
> Expected result: Spark should preserve the column names that were defined in Hive.
> Actual Result:
> {code}
> root
>  |-- _col0: long (nullable = true)
>  |-- _col1: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> Possibly related to SPARK-14959?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message