phoenix-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-5258) Add support to parse header from the input CSV file as input columns for CsvBulkLoadTool
Date Mon, 03 Jun 2019 19:42:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854984#comment-16854984
] 

Josh Elser commented on PHOENIX-5258:
-------------------------------------

{quote}how generally the skip header works in CSVBulkLoadTool, does it skip first line for
every input split, doesn't it possible that same CSV file is split into two inputSplits and
our InputFormat is skipping the first line for each split resulting in one actual row less?
{quote}
Good question! This is done by the custom RecordReader and unwrapping the InputSplit, only
to consume the first records when we're starting from the beginning of a file: [https://github.com/apache/phoenix/blob/20bc74145762d2b19e80b609bec901489accd5cb/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixTextInputFormat.java#L60-L70]

There isn't a safe way to do this unless:
 # You unwrap the InputSplit, rewind back to the head of the file and read the first line
in the file (despite the InputSplit telling you not to do that).
 # You read the first line from all input CSV files and cache them in the job configuration.
 # You figure out a way to disallow splitting of the files at the InputFormat level (prevent
a split from ever happening when this option is enabled).

I don't like option number 2. There may be issues with option number 1, but in theory it should
work.

> Add support to parse header from the input CSV file as input columns for CsvBulkLoadTool
> ----------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5258
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5258
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Prashant Vithani
>            Assignee: Prashant Vithani
>            Priority: Minor
>             Fix For: 4.15.0, 5.1.0
>
>         Attachments: PHOENIX-5258-4.x-HBase-1.4.001.patch, PHOENIX-5258-4.x-HBase-1.4.patch,
PHOENIX-5258-master.001.patch, PHOENIX-5258-master.patch
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, CsvBulkLoadTool does not support reading header from the input csv and expects
the content of the csv to match with the table schema. The support for the header can be added
to dynamically map the schema with the header.
> The proposed solution is to introduce another option for the tool `–parse-header`.
If this option is passed, the input columns list is constructed by reading the first line
of the input CSV file.
>  * If there is only one file, read the header from the first line and generate the `ColumnInfo`
list.
>  * If there are multiple files, read the header from all the files, and throw an error
if the headers across files do not match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message