drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5941) Skip header / footer logic works incorrectly for Hive tables when file has several input splits
Date Mon, 13 Nov 2017 19:48:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250120#comment-16250120
] 

ASF GitHub Bot commented on DRILL-5941:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    For FWIW, the native CSV reader does the following:
    
    * To read the header, it seeks to offset 0 in the file, regardless of the block being
read, then reads the header, which may be a remote read.
    * The reader then seeks to the start of its block. If this is the first block, it skips
the header, else it searches for the start of the next record.
    
    Since Hive has the same challenges, Hive must have solved this, we have only to research
that existing solution.
    
    One simple solution is:
    
    * If block number is 0, skip the header.
    * If block number is 1 or larger, look for the next record separator.


> Skip header / footer logic works incorrectly for Hive tables when file has several input
splits
> -----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5941
>                 URL: https://issues.apache.org/jira/browse/DRILL-5941
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.11.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>             Fix For: Future
>
>
> *To reproduce*
> 1. Create csv file with two columns (key, value) for 3000029 rows, where first row is
a header.
> The data file has size of should be greater than chunk size of 256 MB. Copy file to the
distributed file system.
> 2. Create table in Hive:
> {noformat}
> CREATE EXTERNAL TABLE `h_table`(
>   `key` bigint,
>   `value` string)
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/tmp/h_table'
> TBLPROPERTIES (
>  'skip.header.line.count'='1');
> {noformat}
> 3. Execute query {{select * from hive.h_table}} in Drill (query data using Hive plugin).
The result will return less rows then expected. Expected result is 3000028 (total count minus
one row as header).
> *The root cause*
> Since file is greater than default chunk size, it's split into several fragments, known
as input splits. For example:
> {noformat}
> maprfs:/tmp/h_table/h_table.csv:0+268435456
> maprfs:/tmp/h_table/h_table.csv:268435457+492782112
> {noformat}
> TextHiveReader is responsible for handling skip header and / or footer logic.
> Currently Drill creates reader [for each input split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
and skip header and /or footer logic is applied for each input splits, though ideally the
above mentioned input splits should have been read by one reader, so skip / header footer
logic was applied correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message