phoenix-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankit Singhal (JIRA)" <>
Subject [jira] [Commented] (PHOENIX-5258) Add support to parse header from the input CSV file as input columns for CsvBulkLoadTool
Date Tue, 07 May 2019 18:44:00 GMT


Ankit Singhal commented on PHOENIX-5258:

Couple of things from my side as well:-
* how generally the skip header works in CSVBulkLoadTool, does it skip first line for every
input split, doesn't it possible that same CSV file is split into two inputSplits and our
InputFormat is skipping the first line for each split resulting in one actual row less?
* Phoenix support case sensitive columns as well, is it possible to preserve the case sensitivity
of the header.
Instead doing:-
+	        headerColumns = Lists.newArrayList(Splitter.on(",").trimResults().split(header));
+	        CollectionUtils.transform(headerColumns, new Transformer() {
+	            @Override
+	            public Object transform(Object input) {
+	                return input.toString().toUpperCase();
+	            }
+	        });

> Add support to parse header from the input CSV file as input columns for CsvBulkLoadTool
> ----------------------------------------------------------------------------------------
>                 Key: PHOENIX-5258
>                 URL:
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Prashant Vithani
>            Priority: Minor
>             Fix For: 4.15.0, 5.1.0
>         Attachments: PHOENIX-5258-4.x-HBase-1.4.patch, PHOENIX-5258-master.patch
>          Time Spent: 40m
>  Remaining Estimate: 0h
> Currently, CsvBulkLoadTool does not support reading header from the input csv and expects
the content of the csv to match with the table schema. The support for the header can be added
to dynamically map the schema with the header.
> The proposed solution is to introduce another option for the tool `–parse-header`.
If this option is passed, the input columns list is constructed by reading the first line
of the input CSV file.
>  * If there is only one file, read the header from the first line and generate the `ColumnInfo`
>  * If there are multiple files, read the header from all the files, and throw an error
if the headers across files do not match.

This message was sent by Atlassian JIRA

View raw message