drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
Date Tue, 09 May 2017 05:22:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002056#comment-16002056

Paul Rogers commented on DRILL-5490:

More bizarre. In actual use, {{recordStart}} is, in fact, 0 during the record. But, when extracting
headers, the header is read, then {{finishRecord()}} is called:

  public void finishRecord() {
    recordStart = characterData;

At this point, {{recordStart}} is set to "record end"...

So, this line actually works:

    if (this.recordStart != characterData) {
      throw new ExecutionSetupException("record text was requested before finishing record");

But, it works in the sense that {{recordStart}} and {{characterData}} both point to the *end*
of the record. If it were not for the issue described in DRILL-5491, the logic would be that
{{recordStart}} would be zero if we didn't finish the record, but we we always do, so the
check is a complete no-op.

> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --------------------------------------------------------------
>                 Key: DRILL-5490
>                 URL: https://issues.apache.org/jira/browse/DRILL-5490
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
> The class {{RepeatedVarCharOutput}}, used to read text data into the {{columns}} array,
has a member called {{recordStart}} which is supposed to track the memory offset of the start
of the record data in the {{columns}} VarChar array. However, the logic for the member is
very wrong. First, it is initialized based on an uninitialized variable:
> {{code}}
>   public void startBatch() {
>     this.recordStart = characterDataOriginal;
>     ...
>     loadVarCharDataAddress();
>   }
>   private void loadVarCharDataAddress(){
>     ...
>     this.characterDataOriginal = buf.memoryAddress();
>     ...
>   }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar data. When data
would exceed the buffer, memory is reallocated using the following method. But, this method
*does not* adjust the {{recordStart}} member, which now points to a bogus memory address:
> {code}
>   private void expandVarCharData(){
>     vector.getDataVector().reAlloc();
>     long diff = characterData - characterDataOriginal;
>     loadVarCharDataAddress();
>     characterData += diff;
>   }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the fact that
the variable is never used in a way that cause obvious problems (else these issues would have
been found sooner.) The only real use is:
> {code}
>   @Override
>   public boolean rowHasData() {
>     return recordStart < characterData;
>   }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
>     } catch(StreamFinishedPseudoException e) {
>       // if we've written part of a field or all of a field, we should send this row.
>       if (fieldsWritten == 0 && !output.rowHasData()) {
>         throw e;
>       }
>     }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to Minor.

This message was sent by Atlassian JIRA

View raw message