impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (JIRA)" <>
Subject [jira] [Resolved] (IMPALA-5304) Parquet scanner transfers decompression buffers when not needed
Date Mon, 22 May 2017 14:53:04 GMT


Tim Armstrong resolved IMPALA-5304.
       Resolution: Fixed
    Fix Version/s: Impala 2.9.0

IMPALA-5304: reduce transfer of Parquet decompression buffers

The buffers contain the Parquet DataPages, which need to be
attached to the row batch if the rows point to var-len data
stored directly in the page. Otherwise the buffers can be
discarded once the values in the page have been materialized.

This reduces the amount of memory transferred between threads, which is
a known TCMalloc anti-pattern. It also allows us to free memory
earlier, which may help reduce memory consumption slightly.

Also fix a latent bug I noticed where needs_conversion_ is not
always initialised in the constructor.

Ran exhaustive build. Most of the Parquet tests use compressed Parquet,
which should exercise this code path.

Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6
Reviewed-by: Tim Armstrong <>
Tested-by: Impala Public Jenkins

> Parquet scanner transfers decompression buffers when not needed
> ---------------------------------------------------------------
>                 Key: IMPALA-5304
>                 URL:
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>              Labels: perf, resource-management
>             Fix For: Impala 2.9.0
> The Parquet scanner always transfers decompression buffers to the scratch batch:
> {code}
> Status BaseScalarColumnReader::ReadDataPage() {
>   // We're about to move to the next data page.  The previous data page is
>   // now complete, pass along the memory allocated for it.
>   parent_->scratch_batch_->mem_pool()->AcquireData(decompressed_data_pool_.get(),
> {code}
> These in turn are passed along with the row batch. This is safe but unnecessary in many
cases where the batch does not hold pointers into the decompression buffer: if the column
has only fixed-length data, or if the data page is dictionary-encoded.
> This can make problems like IMPALA-4923 worse than they would be otherwise because extra
data is transferred across threads.

This message was sent by Atlassian JIRA

View raw message