spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files
Date Wed, 20 Jul 2016 02:02:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385184#comment-15385184
] 

Tejas Patil commented on SPARK-16628:
-------------------------------------

Thanks for notifying [~yhuai]. Is this specific to ORC only ? I remember for the change I
made, the codepath was similar to what Parquet used (and there was spark.sql.hive.convertMetastoreParquet
as well).

> OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation
if metastore schema does not match schema stored in ORC files
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16628
>                 URL: https://issues.apache.org/jira/browse/SPARK-16628
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC table represented
by a MetastoreRelation to HadoopFsRelation that uses Spark's OrcFileFormat internally. This
conversion aims to make table scanning have a better performance since at runtime, the code
path to scan HadoopFsRelation's performance is better. However, OrcFileFormat's implementation
is based on the assumption that ORC files store their schema with correct column names. However,
before Hive 2.0, an ORC table created by Hive does not store column name correctly in the
ORC files (HIVE-4243). So, for this kind of ORC datasets, we cannot really convert the code
path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling {{spark.sql.hive.convertMetastoreOrc}}
will introduce a runtime exception for non-partitioned ORC tables and drop the metastore schema
for partitioned ORC tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message