drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5266) Parquet Reader produces "low density" record batches
Date Wed, 15 Feb 2017 23:58:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868835#comment-15868835
] 

Paul Rogers commented on DRILL-5266:
------------------------------------

More silly code:

{code}
public abstract class ColumnReader<V extends ValueVector> {
  ...
  // length of single data value in bits, if the length is fixed
  int dataTypeLengthInBits;
  ...
  protected ColumnReader(ParquetRecordReader parentReader, int allocateSize, ColumnDescriptor
descriptor, ...
    ...
      if (columnDescriptor.getType() == PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY) {
        dataTypeLengthInBits = columnDescriptor.getTypeLength() * 8;
      } else {
        dataTypeLengthInBits = ParquetRecordReader.getTypeLengthInBits(columnDescriptor.getType());
      }
  ...
  protected boolean checkVectorCapacityReached() {
    if (bytesReadInCurrentPass + dataTypeLengthInBits > capacity()) {
{code}

Note that the code adds a variable called "bytes" with one called "bits" and compares it to
a capacity in bytes. But, that might be OK because the variable is named "bits" but sometimes
holds bytes (see line above.) But, at other time it holds bits (see other line above.)

So, we have a variable that holds bits some times, bytes others, and is compared to bytes
all the time...

> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet produces
"low-density" batches: batches in which only 5% of each value vector contains actual data,
with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted
space, using only 5% of available memory to hold actual query data. The result is poor performance
of the sort as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use estimates. The
following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector
size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message