drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5351) Excessive bounds checking in the Parquet reader
Date Thu, 16 Mar 2017 21:07:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928912#comment-15928912
] 

ASF GitHub Bot commented on DRILL-5351:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/781#discussion_r106491214
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
           assert index >= 0;
     
           final int currentOffset = offsetVector.getAccessor().get(index);
    -      while (data.capacity() < currentOffset + bytes.length) {
    -        reAlloc();
    -      }
           offsetVector.getMutator().setSafe(index + 1, currentOffset + bytes.length);
    -      data.setBytes(currentOffset, bytes, 0, bytes.length);
    +      try {
    +        data.setBytes(currentOffset, bytes, 0, bytes.length);
    +      } catch (IndexOutOfBoundsException e) {
    --- End diff --
    
    While this is a clever way to avoid checks, it will lead to difficulty when debugging
Drill. Intentionally throwing a common exception makes it even harder to find cases where
the exception indicates an error.
    
    Let's take a step back. One of the things we need to change in Parquet is to avoid "low
density batches" vectors that have very little data. Turns out one reason is tied up with
the assumption that the code makes that it can tell when it has reached the end of a vector.
(There are many bugs, but that is the key idea.)
    
    Vectors don't have that ability today, so the code never worked.
    
    What if we solve that problem, and yours, by changing how the DrillBuf works:
    
    ```
    if ( ! data.setBytesIfCapacity(...)) {
      reAlloc();
      data.setBytes(...)
    }
    ```
    
    The above avoids the spurious exception *and* provides the means to manage variable-length
vectors in Parquet.
    
    Note that the bounds check is still done, but only inside Drillbuf. And, of course, that
same check is done with the PR code: that check is what raises the exception.


> Excessive bounds checking in the Parquet reader 
> ------------------------------------------------
>
>                 Key: DRILL-5351
>                 URL: https://issues.apache.org/jira/browse/DRILL-5351
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck
making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
>   NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
>   DrillBuf.chk(int, int)
>   NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking and eventually
of course, the actual write to the ByteBuf is also bounds checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve
performance of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass.

> I would recommend we allow users to turn this check off if there are performance critical
queries.
> Removing the bounds checking at every level is going to be a fair amount of work. In
the meantime, it appears that a few simple changes to variable length vectors improves query
performance by about 10% across the board. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message