drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5266) Parquet Reader produces "low density" record batches
Date Wed, 15 Feb 2017 21:56:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868627#comment-15868627
] 

Paul Rogers edited comment on DRILL-5266 at 2/15/17 9:56 PM:
-------------------------------------------------------------

The logic for determining field widths is confusing.

{code}
  public int next() {
    ...
      if (allFieldsFixedLength) {
        ...
      } else { // variable length columns
        long fixedRecordsToRead = varLengthReader.readFields(recordsToRead, firstColumnStatus);
// Read var
        readAllFixedFields(fixedRecordsToRead); // Read fixed
      }
{code}

The above claims that we call one method to read variable length fields, then another to read
fixed length fields. Fine, presumably we pack in the variable-length fields, figure out how
many records that is, then read the fixed length data to match. Makes sense. But then:

{code}
public class VarLenBinaryReader {
  public long readFields(long recordsToReadInThisPass, ColumnReader<?> firstColumnStatus)
throws IOException {
    ...
    recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass);
    ...
  }

  private long determineSizesSerial(long recordsToReadInThisPass) throws IOException {
    ...
      // check that the next record will fit in the batch
      if (exitLengthDeterminingLoop ||
          (recordsReadInCurrentPass + 1) * parentReader.getBitWidthAllFixedFields()
              + totalVariableLengthData + lengthVarFieldsInCurrentRecord > parentReader.getBatchSize())
{
{code}

That is, the *variable* length reader is making its decision about when to stop based, in
part on *fixed* length fields. This is contradictory to the earlier code, rendering the entire
operational incoherent.

Given that the variable length width is not returned (see above), the calculation reduces
down to dividing batch size by fixed length record width. This can all be refactored to be
simpler.



was (Author: paul-rogers):
The logic for determining field widths is confusing.

{code}
  public int next() {
    ...
      if (allFieldsFixedLength) {
        ...
      } else { // variable length columns
        long fixedRecordsToRead = varLengthReader.readFields(recordsToRead, firstColumnStatus);
// Read var
        readAllFixedFields(fixedRecordsToRead); // Read fixed
      }
{code}

The above claims that we call one method to read variable length fields, then another to read
fixed length fields. Fine, presumably we pack in the variable-length fields, figure out how
many records that is, then read the fixed length data to match. Makes sense. But then:

{code}
public class VarLenBinaryReader {
  public long readFields(long recordsToReadInThisPass, ColumnReader<?> firstColumnStatus)
throws IOException {
    ...
    recordsReadInCurrentPass = determineSizesSerial(recordsToReadInThisPass);
    ...
  }

  private long determineSizesSerial(long recordsToReadInThisPass) throws IOException {
    ...
      // check that the next record will fit in the batch
      if (exitLengthDeterminingLoop ||
          (recordsReadInCurrentPass + 1) * parentReader.getBitWidthAllFixedFields()
              + totalVariableLengthData + lengthVarFieldsInCurrentRecord > parentReader.getBatchSize())
{
{code}

That is, the *variable* length reader is making its decision about when to stop based, in
part on *fixed* length fields. This is contradictory to the earlier code, rendering the entire
operational incoherent.


> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet produces
"low-density" batches: batches in which only 5% of each value vector contains actual data,
with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted
space, using only 5% of available memory to hold actual query data. The result is poor performance
of the sort as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use estimates. The
following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector
size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message