drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5487) Vector corruption in CSV with headers and truncated last row
Date Mon, 08 May 2017 21:11:04 GMT
Paul Rogers created DRILL-5487:

             Summary: Vector corruption in CSV with headers and truncated last row
                 Key: DRILL-5487
                 URL: https://issues.apache.org/jira/browse/DRILL-5487
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers

The CSV format plugin allows two ways of reading data:

* As named columns
* As a single array, called {{columns}}, that holds all columns for a row

The named columns feature will corrupt the offset vectors if the last row of the file is truncated:
leaves off one or more columns.

To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form:


Note that the file is truncated: the command and second field is missing on the last line.

Then, I created a simple test using the "cluster fixture" framework:

  public void readerTest() throws Exception {
    FixtureBuilder builder = ClusterFixture.builder()

    try (ClusterFixture cluster = builder.build();
         ClientFixture client = cluster.clientFixture()) {
      TextFormatConfig csvFormat = new TextFormatConfig();
      csvFormat.fieldDelimiter = ',';
      csvFormat.skipFirstLine = false;
      csvFormat.extractHeader = true;
      cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
      String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";

The results show we've got a problem:

Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
IllegalArgumentException: length: -3 (expected: >= 0)

If the last line were:


Then the offset vector should look like this:

[0, 3, 3]

Very likely we have an offset vector that looks like this instead:

[0, 3, 0]

When we compute the second column of the second row, we should compute:

length = offset[2] - offset[1] = 3 - 3 = 0

Instead we get:

length = offset[2] - offset[1] = 0 - 3 = -3

The summary is that a premature EOF appears to cause the "missing" columns to be skipped;
they are not filled with a blank value to "bump" the offset vectors to fill in the last row.
Instead, they are left at 0, causing havoc downstream in the query.

This message was sent by Atlassian JIRA

View raw message