drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5941) Skip header / footer logic works incorrectly for Hive tables when file has several input splits
Date Tue, 14 Nov 2017 17:39:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251801#comment-16251801

ASF GitHub Bot commented on DRILL-5941:

Github user arina-ielchiieva commented on the issue:

    To create reader for each input split and maintain skip header / footer functionality
we need to know how many rows are in input split. Unfortunately, input split does not hold
such information, only number of bytes. [1] We can't apply skip header functionality for the
first input split and skip footer for the last input either since we don't know how many rows
will be skipped, it can be the situation that we need to skip the whole first input split
and partially second.
    To read from hive we actually use Hadoop reader [2, 3] so if I am not mistaken unfortunately
the described above approach can be applied.
    [1] https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/FileSplit.html
    [2] https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveAbstractReader.java#L234
    [3] https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/RecordReader.html

> Skip header / footer logic works incorrectly for Hive tables when file has several input
> -----------------------------------------------------------------------------------------------
>                 Key: DRILL-5941
>                 URL: https://issues.apache.org/jira/browse/DRILL-5941
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.11.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>             Fix For: Future
> *To reproduce*
> 1. Create csv file with two columns (key, value) for 3000029 rows, where first row is
a header.
> The data file has size of should be greater than chunk size of 256 MB. Copy file to the
distributed file system.
> 2. Create table in Hive:
> {noformat}
>   `key` bigint,
>   `value` string)
>   'org.apache.hadoop.mapred.TextInputFormat'
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>   'maprfs:/tmp/h_table'
>  'skip.header.line.count'='1');
> {noformat}
> 3. Execute query {{select * from hive.h_table}} in Drill (query data using Hive plugin).
The result will return less rows then expected. Expected result is 3000028 (total count minus
one row as header).
> *The root cause*
> Since file is greater than default chunk size, it's split into several fragments, known
as input splits. For example:
> {noformat}
> maprfs:/tmp/h_table/h_table.csv:0+268435456
> maprfs:/tmp/h_table/h_table.csv:268435457+492782112
> {noformat}
> TextHiveReader is responsible for handling skip header and / or footer logic.
> Currently Drill creates reader [for each input split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
and skip header and /or footer logic is applied for each input splits, though ideally the
above mentioned input splits should have been read by one reader, so skip / header footer
logic was applied correctly.

This message was sent by Atlassian JIRA

View raw message