drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'
Date Tue, 04 Jul 2017 20:11:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074056#comment-16074056

Paul Rogers commented on DRILL-5239:

See [RFC-4180|https://tools.ietf.org/html/rfc4180] for the IETF standard for CSV files. This
standard does not support headers. Drill's "compliant" text reader complies with RFC-4180.

As noted in [Wikipedia|https://en.wikipedia.org/wiki/Comma-separated_values], despite RFC-4180,
CSV is not a well-defined format; it is instead often an informal theme with each system handing
details slightly differently. Indeed, Drill violates RFC-4180 in that we allow Linux-style
line terminators instead of the DOS-style terminators specified in the RFC.

Also came across [Super CSV|https://super-csv.github.io/super-csv/csv_specification.html]
which has a nice summary of the rules in a more readable format than the RFC.

Someone has specified a [JSON format|http://specs.frictionlessdata.io/csv-dialect/] to describe
CSV file formats.

So, we can see that, in normal practice, comments are not a part of the RFC-4180 format. So,
your suggestion to have them turned off by default is good.

Let's dig into comments a bit. Here is our [first hint|https://stackoverflow.com/questions/1961006/can-a-csv-file-have-a-comment],
third answer down:

In engineering data, it is common to see the # symbol in the first column used to signal a

I use the [ostermiller CSV parsing library|http://ostermiller.org/utils/CSV.html] to read
and process such files. That library allows you to set the comment character. After the parse
operation you get an array just containing the real data, no comments.

The [Ostermiller Java Utilities, Comma Separated Values (CSV)|http://ostermiller.org/utils/CSV.html]
looks to be a good source for what people actually do in practice; the comments in the description
suggest the author kept tinkering with the library to handle the various file formats "in
the wild." Of particular interest is the difference between what he calls the "Unix stye"
and "Excel style" of CSV files. I don't think our code handles these differences. The bottom
of the page lists a number of other CSV resources.

Maybe we should just use the Ostermiller code rather than tinkering with the existing code?
Especially since, as part of the "memory fragmentation" project, I've already ripped out and
replaced the header parser and will need to replace the two implementations that write to
value vectors...

We've talked about comments. Another issue is blank lines. The [Ostermiller CSV lexer|http://ostermiller.org/utils/src/CSVLexer.lex.html]
says it ignores blank lines. Perhaps we should also, since a blank line can never be a valid
record (unless the header states that the file has only one field... You gotta love CSV...)

That same page shows how the lexer handles comments:


In this case, a comment character starts the comment (which, this being a lexer, could be
anywhere in the line.) If the resulting line is blank, it is ignored. Perhaps overkill...

Drill's CSV parser is based on the [uniVocity CSV parser|http://docs.univocity.com/parsers/2.4.1/com/univocity/parsers/csv/CsvParser.html],
now maintained on [GitHub|https://github.com/uniVocity/univocity-parsers]. A good question
is whether we can just use the jar available from the GitHub site rather than copying the
code as we have done.

Go down to the [Examples|https://github.com/uniVocity/univocity-parsers#reading-csv] section.
Voila! There is a CSV file in all its glory with blank lines and comments!

So, let us set our goal to read this [example file|http://github.com/uniVocity/univocity-parsers/tree/master/src/test/resources/examples/example.csv].
We need a set of options that allow us to do so. This means:

* File has headers (true/*false*)
* Read headers (true/*false*)
* Skip blank lines (*true*/false)
* Comment character (any single character, blank by default meaning no comment)
* Unix extensions (true/*false*)

If comments are enabled, a line may have leading whitespace before the comment character.
And, in a file with comments, a value can contain a comment by quoting:

# A file with comment
col1, col2, col3
# Next line is a comment
# So is this next one
# Next line is not a comment (Excel style)
# Next line is not a comment (Unix style)
# Not a comment, leading space in value (i.e. "  #fred"), Unix style

The lesson in all this is that CSV files have a long history and exist in the field with many
variations. Mature parsers have learned to handle these variations. Because Drill must handle
data as it actually exists, not as we'd like it to be, Drill would be well served to learn
from those who came before us and to correctly support the variations encoded in these various
CSV libraries. Fortunately for us, if we just use the (newer? complete?) uniVocity libraries,
we get that for free.

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file

This message was sent by Atlassian JIRA

View raw message