hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferenc Erdelyi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-13613) Add computeSplitSize() to CombineHiveInputFormat and HiveInputFormat
Date Tue, 26 Apr 2016 14:09:13 GMT
Ferenc Erdelyi created HIVE-13613:
-------------------------------------

             Summary: Add computeSplitSize() to CombineHiveInputFormat and HiveInputFormat
                 Key: HIVE-13613
                 URL: https://issues.apache.org/jira/browse/HIVE-13613
             Project: Hive
          Issue Type: Improvement
          Components: Hive
    Affects Versions: 1.1.0
            Reporter: Ferenc Erdelyi


The input formats that Hive uses (CombineHiveInputFormat and HiveInputFormat) do not use the
computeSplitSize().
CombineHiveInputFormat and HiveInputFormat do not extend FileInputFormat so that functionality
is not there. 
For tuning parquet file processing the computeSplitSize() could be used.

Please add computeSplitSize() functionality to CombineHiveInputFormat and HiveInputFormat.

Use case:
It would be desirable for our Hive query to autoselect the right splitsize (and consequently
number of mappers) based on the data's blocksize as this is providing us with significant
performance gains (e.g. for processing parquet files). Looking in https://github.com/cloudera/hadoop-common/blob/cdh5-2.6.0_5.5.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
this is the behaviour I would expect from computeSplitSize().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message