drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5991) Performance improvements for Hive tables with skip header / footer logic
Date Fri, 24 Nov 2017 10:30:00 GMT
Arina Ielchiieva created DRILL-5991:
---------------------------------------

             Summary: Performance improvements for Hive tables with skip header / footer logic
                 Key: DRILL-5991
                 URL: https://issues.apache.org/jira/browse/DRILL-5991
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Hive
    Affects Versions: 1.12.0
            Reporter: Arina Ielchiieva


Currently when Hive table has header / footer all input split of the file are processed by
one reader. This has performance impact better way would be to keep one reader per split and
see if we can figure out a way to tell readers how many rows they should skip.

To create reader for each input split and maintain skip header / footer functionality we need
to know how many rows are in input split. Unfortunately, input split does not hold such information,
only [number of bytes|https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/FileSplit.html].
We can't apply skip header functionality for the first input split and skip footer for the
last input either since we don't know how many rows will be skipped, it can be the situation
that we need to skip the whole first input split and partially second. Also we use [Hadoop
reader|https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/RecordReader.html]
for the data and don't have information about number of rows in input split.

Possible improvements:
1. For table with header only before creating readers we can start skipping header and when
done, create reader at that position, for other untouched input splits create separate readers
though all readers will be on the same node.
2. Consider Drill text reader usage instead of Hadoop one (as we do for parquet files) which
might provide more flexibility in terms of offsetting bytes etc. This should be investigated
further.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message