drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parth Chandra (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5351) Excessive bounds checking in the Parquet reader
Date Mon, 13 Mar 2017 17:15:41 GMT
Parth Chandra created DRILL-5351:
------------------------------------

             Summary: Excessive bounds checking in the Parquet reader 
                 Key: DRILL-5351
                 URL: https://issues.apache.org/jira/browse/DRILL-5351
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Parth Chandra


In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck
making the reader CPU bound rather than disk bound.
A yourkit profile indicates the following methods being severe bottlenecks -

VarLenBinaryReader.determineSizeSerial(long)
  NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
  DrillBuf.chk(int, int)
  NullableVarBinaryVector$Mutator.fillEmpties()

The problem is that each of these methods does some form of bounds checking and eventually
of course, the actual write to the ByteBuf is also bounds checked.

DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve performance
of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass. 

I would recommend we allow users to turn this check off if there are performance critical
queries.

Removing the bounds checking at every level is going to be a fair amount of work. In the meantime,
it appears that a few simple changes to variable length vectors improves query performance
by about 10% across the board. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message