drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'
Date Wed, 05 Jul 2017 17:52:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075143#comment-16075143
] 

Roman commented on DRILL-5239:
------------------------------

[~paul-rogers],
Thank you for the great introduction at CSV files at all!

So let's combine all of the above to modify CSV reading. I want to mark all cases which I
need to make according to this ticket:

*1)* What do you think, do I need to create session/system option which treats or skips comments
(1st from my previous message) or set up "blank" comment symbol should be enough?
*2) File has headers.* In this case I will create session/system option which skips header
in case if we treat comments as a data only ( *1)* point). But as I understand header not
always starts from comment symbol. So I think there could be some problems to divide a header
(which does not start from comment symbol) from the data. I think we should make this option
as byte type (not boolean) where customer can add manually how many lines we should skip from
the beginning of the document (It seems customer should have own template for header in all
CSV files). Or if we have "0" (default value) - we will skip all comment lines from top.
*3) Read headers.* It seems we duplicating *2)* point. Could you please explain what I need
to add? Maybe you have different view on *2)* and *3)* points? 
*4) Skip blank lines.* Nice find! In this case I will create separate session/system option
which skips or not blank lines.
*5) Unix extensions.* In this case  I will create separate session/system option which removes
"\ #" symbols from the beginning of the line and add rest as a data.

Could you please tell me you thoughts? Maybe you have some notes that I should make inside
this Jira?

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message