impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tianyi Wang (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately
Date Wed, 23 Aug 2017 04:43:51 GMT
Tianyi Wang has uploaded a new patch set (#2).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch addes collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
> select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items couting into
collection_items_read_counter. Both counters are updated every row batch
read. In the query described above, scanning time is decreased by 10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
6 files changed, 77 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <twang@cloudera.com>

Mime
View raw message