drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5266) Parquet Reader produces "low density" record batches
Date Wed, 15 Feb 2017 17:35:41 GMT
Paul Rogers created DRILL-5266:
----------------------------------

             Summary: Parquet Reader produces "low density" record batches
                 Key: DRILL-5266
                 URL: https://issues.apache.org/jira/browse/DRILL-5266
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.10
            Reporter: Paul Rogers


Testing with the managed sort revealed that, for at least one file, Parquet produces "low-density"
batches: batches in which only 5% of each value vector contains actual data, with the rest
being unused space. When fed into the sort, we end up buffering 95% of wasted space, using
only 5% of available memory to hold actual query data. The result is poor performance of the
sort as it must spill far more frequently than expected.

The managed sort analyzes incoming batches to prepare good memory use estimates. The following
the the output from the Parquet file in question:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size:
49152, data size: 30327, row capacity: 4095, density: 62)
  Records: 1129, Total size: 32006144, Row width:28350, Density:5}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message