drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5914) CSV (text) reader fails to parse quoted newlines in trailing fields
Date Mon, 30 Oct 2017 23:54:00 GMT
Paul Rogers created DRILL-5914:
----------------------------------

             Summary: CSV (text) reader fails to parse quoted newlines in trailing fields
                 Key: DRILL-5914
                 URL: https://issues.apache.org/jira/browse/DRILL-5914
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.11.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. The input file
is as follows:

```
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
```

Note the newline in side the description in the last record.

If we do a `SELECT *` query, the file is parsed fine; we get 4 records.

If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: it short-circuits
reads on the three columns that are not wanted:

```
TextReader.parseRecord() {
...
        if (earlyTerm) {
          if (ch != newLine) {
            input.skipLines(1); // <-- skip lines
          }
          break;
        }
```

This method skips forward in the file, discarding characters until it hits a newline:

```
      do {
        nextChar();
      } while (lineCount < expectedLineCount);
```

Note that this code handles individual characters, it is not aware of per-field semantics.
That is, unlike the higher-level parser methods, the `nextChar()` method does not consider
newlines inside of quoted fields to be special.

This problem shows up acutely in a `SELECT COUNT(*)` style query that skips all fields; the
result is we count the input as five lines, not four.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message