impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
Date Mon, 02 May 2016 20:16:54 GMT
Hello Marcel Kornacker, Skye Wanderman-Milne, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/2779

to look at the new patch set (#10).

Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
......................................................................

IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.

This change is a first step towards a more efficient Parquet scanner.
The focus is on presenting the new code flow that materializes
the table-level slots in a column-wise fashion, without going deep
into actually improving scan efficieny.

After these changes there are several obvious places that should
be optimized to realize efficiency gains.

Summary of changes
- the table-level tuples are materialized in a column-wise fashion
  with new ColumnReader::ReadValueBatch() functions
- this is done by materializing a 'scratch' batch, and transferring
  scratch tuples that survive filters/conjuncts to the output batch
- the tuples of nested collections are still materialized in
  a row-wise fashion using the ColumnReader::ReadValue() function,
  just as before

Mini benchmark
I ran the following queries on a single impalad before and after my
change using a synthetic 'huge_lineitem' table.
I modified hdfs-scan-node.cc to set the number of rows of any row
batch to 0 to focus the measurement on the scan time.

Query options:
set num_scanner_threads=1;
set disable_codegen=true;
set num_nodes=1;

select * from huge_lineitem;
Before: 22.39s
Afer:   18.50s

select * from huge_lineitem where l_linenumber < 0;
Before: 25.11s
After:  20.56s

select * from huge_lineitem where l_linenumber % 2 = 0;
Before: 26.32s
After:  21.82s

Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/runtime/runtime-state.cc
M be/src/util/rle-encoding.h
A testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
M tests/common/test_result_verifier.py
M tests/query_test/test_scanners.py
8 files changed, 504 insertions(+), 177 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/79/2779/10
-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>

Mime
View raw message