impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henry Robinson (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3804: Push per-split filtering into scanners
Date Fri, 01 Jul 2016 22:23:12 GMT
Henry Robinson has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3561

Change subject: IMPALA-3804: Push per-split filtering into scanners
......................................................................

IMPALA-3804: Push per-split filtering into scanners

IMPALA-3798 was a bug that occurred when a header split was filtered
out, without correctly cancelling all the scan ranges in the rest of the
file. To fix this properly, we have to make the scanners aware of
per-split filtering, since different scanners need to compensate for a
filtered scan in different ways.

For example, sequence-based scanners (such as Avro) only issue most of a
file's ranges after the header range has been scanner. Therefore if a
header split is filtered out, all the remaining ranges can be safely
marked as complete. If a non-header split is filtered, it may not be
safe to mark as complete a split that may be concurrent scanned by a
different scanner.

The text scanner issues all ranges at once, so it is only safe to mark
the current range as complete. The Parquet scanner does something
different: it processes all splits for one file on the same thread, and
so marks all those splits as 'complete' very early on.

This patch adds HdfsScanner::FilterScanRange() which should be called by
ProcessSplit(). FilterScanRange() will return true if the scan range
should not be scanned, and accepts a policy parameter that describes
what compensation action to perform (close all scan ranges, only the
current one, or none).

Testing:

* Added logic to test_sequence_file_filtering_race to check that
  per-scan filtering was happening correctly, confirming that the
  rewritten path was taking effect. Expanded the test to hit all scanner
  types.

* Manually tested existing runtime filters test suite with file
  filtering disabled, and rewriting tests to expect split filtering
  instead of file filtering. Tests passed.

Change-Id: I9f92178f642695e0e9ef901373a5e9f2878a78ce
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/exec/scanner-context.h
M tests/custom_cluster/test_seq_file_filtering.py
10 files changed, 96 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/61/3561/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I9f92178f642695e0e9ef901373a5e9f2878a78ce
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <henry@cloudera.com>

Mime
View raw message