drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
Date Mon, 08 May 2017 23:24:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001777#comment-16001777

Paul Rogers commented on DRILL-5486:

Another complications. Although {{BaseRepeatedValueVector}} sets the {{`columns`}} vector
column count (incorrectly, as noted above), the caller, {{ScanBatch}}, keeps its own track
of record count, and uses code to set that record count on each vector. So, even if {{setValueCount()}}
per vector did anything (which it does not), the wrong value set by {{BaseRepeatedValueVector}}
would be overwritten by {{ScanBatch}}.

> Compliant text reader tries, but fails, to ignore empty rows
> ------------------------------------------------------------
>                 Key: DRILL-5486
>                 URL: https://issues.apache.org/jira/browse/DRILL-5486
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers
(which creates a scalar vector per file column), or without headers (that creates a single
{{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This
class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
>     ...
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>     ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}}
has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and
retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows elsewhere in
the file.
> Further, on the first row, the logic is flawed. The part shown above is only for the
row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
>     recordStart = characterData;
>     ...
>     int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
>     PlatformDependent.putInt(repeatedOffset, newOffset);
>     repeatedOffset += 4;
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the row count
is not. Later, this will cause the container to have one less record.
> This would be a problem if the row count ({{batchIndex}}) was used for the batch row
count. But, it is not, the next level of code keeps a separate count, so the "skip empty row"
logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next
level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the first row.
> * That bug is masked because we always set the row offset, even if we ignore the row
in the row count.
> * That bug is masked because the incorrect row count is ignored when setting the container
row count.

This message was sent by Atlassian JIRA

View raw message