impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tianyi Wang (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately
Date Thu, 31 Aug 2017 21:30:44 GMT
Hello Lars Volker, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7776

to look at the new patch set (#8).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 33 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <twang@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <twang@cloudera.com>

Mime
View raw message