drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5266) Parquet Reader produces "low density" record batches
Date Wed, 15 Feb 2017 23:44:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868657#comment-15868657
] 

Paul Rogers edited comment on DRILL-5266 at 2/15/17 11:43 PM:
--------------------------------------------------------------

The result of the "bits vs. bytes" fix improves density to 28%:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 36156, row capacity: 32768, density: 28)
  ...
  T1¦¦c_customer_id(std col. size: 54, actual col. size: 16, total size: 2392064, vector
size: 2359296, data size: 144624, row capacity: 32768, density: 7)
  ...
  T1¦¦c_salutation(std col. size: 54, actual col. size: 4, total size: 2392064, vector size:
2359296, data size: 28544, row capacity: 32768, density: 2)
  T1¦¦c_first_name(std col. size: 54, actual col. size: 6, total size: 2392064, vector size:
2359296, data size: 50916, row capacity: 32768, density: 3)
  T1¦¦c_last_name(std col. size: 54, actual col. size: 6, total size: 2392064, vector size:
2359296, data size: 53546, row capacity: 32768, density: 3)
  T1¦¦c_preferred_cust_flag(std col. size: 54, actual col. size: 1, total size: 2392064,
vector size: 2359296, data size: 8780, row capacity: 32768, density: 1)
  ...
  T1¦¦c_login(std col. size: 54, actual col. size: 0, total size: 2392064, vector size:
2359296, data size: 0, row capacity: 32768, density: 0)
  T1¦¦c_email_address(std col. size: 54, actual col. size: 27, total size: 2392064, vector
size: 2359296, data size: 238285, row capacity: 32768, density: 11)
  ...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 344064, vector size:
327680, data size: 238285, row capacity: 16383, density: 73)
  Records: 9039, Total size: 32325632, Row width:3577, Density:26}

Input Batch Estimates: record size = 335 bytes; input batch = 32325632 bytes, 9039 records
{code}

The most dense vector is now 73%, but most are 28%. Record count per batch has grown from
the original 1129 to the new 9039.

Performance has improved from 37 secs to 27 secs:
{code}
Results: 1,434,519 records, 30 batches, 26,612 ms
{code}


was (Author: paul-rogers):
The result of the "bits vs. bytes" fix improves density to 28%:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 36156, row capacity: 32768, density: 28)
  ...
  T1¦¦c_customer_id(std col. size: 54, actual col. size: 16, total size: 2392064, vector
size: 2359296, data size: 144624, row capacity: 32768, density: 7)
  ...
  T1¦¦c_salutation(std col. size: 54, actual col. size: 4, total size: 2392064, vector size:
2359296, data size: 28544, row capacity: 32768, density: 2)
  T1¦¦c_first_name(std col. size: 54, actual col. size: 6, total size: 2392064, vector size:
2359296, data size: 50916, row capacity: 32768, density: 3)
  T1¦¦c_last_name(std col. size: 54, actual col. size: 6, total size: 2392064, vector size:
2359296, data size: 53546, row capacity: 32768, density: 3)
  T1¦¦c_preferred_cust_flag(std col. size: 54, actual col. size: 1, total size: 2392064,
vector size: 2359296, data size: 8780, row capacity: 32768, density: 1)
  ...
  T1¦¦c_login(std col. size: 54, actual col. size: 0, total size: 2392064, vector size:
2359296, data size: 0, row capacity: 32768, density: 0)
  T1¦¦c_email_address(std col. size: 54, actual col. size: 27, total size: 2392064, vector
size: 2359296, data size: 238285, row capacity: 32768, density: 11)
  ...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 344064, vector size:
327680, data size: 238285, row capacity: 16383, density: 73)
  Records: 9039, Total size: 32325632, Row width:3577, Density:26}
{code}

The most dense vector is now 73%, but most are 28%.

Performance has improved from 37 secs to 27 secs:
{code}
Results: 1,434,519 records, 30 batches, 26,612 ms
{code}

> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet produces
"low-density" batches: batches in which only 5% of each value vector contains actual data,
with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted
space, using only 5% of available memory to hold actual query data. The result is poor performance
of the sort as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use estimates. The
following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector
size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector
size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message