spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chenxiao Mao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
Date Mon, 27 Aug 2018 05:23:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
] 

Chenxiao Mao commented on SPARK-25175:
--------------------------------------

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables always return a,B,c, no matter whether spark.sql.caseSensitive is
set to true or false and no matter metastore table schema is in lower case or upper case.
Since ORC file schema is (a,B,c,C),
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather than always return
lower case one?
 * The hive serde tables always throw IndexOutOfBoundsException at runtime.
 * Since in case-sensitive mode analysis should fail if a column name in query and metastore
schema are in different cases, all AnalysisException(s) meet our expectation.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
	at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
	at java.util.ArrayList.subList(ArrayList.java:996)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
	at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
	at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
	at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
	at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
{code}
 

> Case-insensitive field resolution when reading from ORC
> -------------------------------------------------------
>
>                 Key: SPARK-25175
>                 URL: https://issues.apache.org/jira/browse/SPARK-25175
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Chenxiao Mao
>            Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet
files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should
add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This
hive OrcFileFormat doesn't support case-insensitive field resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports
case-insensitive field resolution, however it cannot handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message