drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khurram Faraaz (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
Date Thu, 06 Jul 2017 07:11:00 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Khurram Faraaz updated DRILL-5486:
----------------------------------
    Component/s: Storage - Text & CSV

> Compliant text reader tries, but fails, to ignore empty rows
> ------------------------------------------------------------
>
>                 Key: DRILL-5486
>                 URL: https://issues.apache.org/jira/browse/DRILL-5486
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers
(which creates a scalar vector per file column), or without headers (that creates a single
{{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This
class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
>     ...
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>     ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}}
has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and
retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows elsewhere in
the file.
> Further, on the first row, the logic is flawed. The part shown above is only for the
row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
>     recordStart = characterData;
>     ...
>     int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
>     PlatformDependent.putInt(repeatedOffset, newOffset);
>     repeatedOffset += 4;
>     // if there were no defined fields, skip.
>     if (fieldIndex > -1) {
>       batchIndex++;
>       recordCount++;
>     }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the row count
is not. Later, this will cause the container to have one less record.
> This would be a problem if the row count ({{batchIndex}}) was used for the batch row
count. But, it is not, the next level of code keeps a separate count, so the "skip empty row"
logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next
level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the first row.
> * That bug is masked because we always set the row offset, even if we ignore the row
in the row count.
> * That bug is masked because the incorrect row count is ignored when setting the container
row count.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message