hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Panagiotis Garefalakis (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-23014) ORC reading performance
Date Thu, 12 Mar 2020 15:13:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058005#comment-17058005
] 

Panagiotis Garefalakis edited comment on HIVE-23014 at 3/12/20, 3:12 PM:
-------------------------------------------------------------------------

Thanks for the extra details [~petertoth] 
 I have a feeling that the included columns Options is not properly set for the OrcReader and
it ends up reading the whole dataset.
 For instance, for 200columns the runtime is 2x compared to reading 100 columns and in a similar
manner reading 300columns is 3x (while it should read just 1 column each time).

I can also see that there are some major changes in getIncludedColumns method in 2.3.6 that
could be an issue –  [https://github.com/apache/hive/blob/2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L396]

cc: [~gopalv] [~ashutoshc] [~omalley]


was (Author: pgaref):
Thanks for the extra details [~petertoth] 
I have a feeling that the included columns Options is not properly set for the OrcReader and
it ends up reading the whole dataset.
For instance, for 200columns the runtime is 2x compared to reading 100 columns and in a similar
manner reading 300columns is 3x (while it should read just 1 column each time).

I can also see that there are some major changes in getIncludedColumns method in 2.3.6 –
[https://github.com/apache/hive/blob/2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L396]

cc: [~gopalv] [~ashutoshc] [~omalley]

> ORC reading performance
> -----------------------
>
>                 Key: HIVE-23014
>                 URL: https://issues.apache.org/jira/browse/HIVE-23014
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 2.3.6
>            Reporter: Peter Toth
>            Priority: Major
>         Attachments: OrcReadBenchmark-results.txt.hive-1.2.1, OrcReadBenchmark-results.txt.hive-2.3.6
>
>
> Spark 3 adds support for using Hive 2.3.6 besides the old Hive 1.2.1 version. Some of
the ORC reading benchmark shows that there is a huge performance difference in ORC reading
between the 2 versions. I measured that {{org.apache.hadoop.hive.ql.io.orc.ReaderImpl}} in
hive-exec-2.3.6-core.jar is ~3-5 times slower than in hive-exec-1.2.1.spark2.jar.
> I'm not sure if more recent Hive versions still suffer from this performance regression.
> Please see some details here: SPARK-30565



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message