impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Date Fri, 20 Oct 2017 00:48:04 GMT
Hello Thomas Tauber-Marshall, 

I'd like you to reexamine a change. Please visit

to look at the new patch set (#10).

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed

Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 10000;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
        min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
        min(l_returnflag), min(l_linestatus), min(l_shipdate),
        min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
        min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
M be/src/codegen/
M be/src/codegen/
M be/src/codegen/
M be/src/codegen/llvm-codegen.h
M be/src/common/
M be/src/common/status.h
M be/src/exec/
M be/src/exec/
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/
M be/src/runtime/descriptors.h
M be/src/runtime/
M be/src/runtime/runtime-state.h
A be/src/runtime/
M be/src/runtime/
M be/src/runtime/tuple.h
19 files changed, 331 insertions(+), 56 deletions(-)

  git pull ssh:// refs/changes/46/8146/10
To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Thomas Tauber-Marshall <>
Gerrit-Reviewer: Tim Armstrong <>

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message