impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3780: avoid many small reads past end of block
Date Thu, 30 Jun 2016 06:37:15 GMT
Hello Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3518

to look at the new patch set (#5).

Change subject: IMPALA-3780: avoid many small reads past end of block
......................................................................

IMPALA-3780: avoid many small reads past end of block

The text scanner had some pathological behaviour when reading
significantly past the end of it scan range. E.g. reading a 256mb string
that's split across blocks. ScannerContext defaulted to issuing 1kb
reads, even if the scan node requested significantly more data. E.g. if
the Parquet scanner called ReadBytes(16mb), this was chopped up into
1kb reads, which were reassembled in boundary_buffer_.

Increase the minimum read size in this case to 64kb. Reading that amount
of data should not have any significant overhead even if we only read
a few bytes past the end of the scan range.

ScannerContext implements a saner default algorithm that will work better
if scanners make many small reads: it starts with 64kb reads and doubles
the size of each successive read past the end of the scan range. We
also correct pass the 'read_past_size' into GetNextBuffer(), so that
we always read the right amount of data.

Also save some time by pre-sizing the boundary buffer to the correct
size instead of reallocating it multiple times.

Testing:
Add test case that exercises the code paths for very large strings.

Performance:
The queries in the test case are vastly faster than before. E.g. 0.6s
vs ~60s for the count(*) query.

Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/string-buffer.h
M tests/query_test/test_insert.py
6 files changed, 90 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/18/3518/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3518
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>

Mime
View raw message