impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu
Date Wed, 01 Nov 2017 23:33:47 GMT
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
......................................................................


Patch Set 9:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@13
PS9, Line 13: In RuntimeFilterGenerator in the planner, each partitioned hash join
... each hash join node generates a bloom and min-max filter for each equi-join predicate,
but only those ...


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@27
PS9, Line 27: For now, min-max filters are only applied at the KuduScanner, which
Not specific to the code changes, and I don't expect a response here (probably too long :)).

How do the existing query options around runtime filters affect the new min/max filters on
Kudu? For example, what does DISABLE_ROW_RUNTIME_FILTERING mean for the Kudu min/max filters?

How should users think about setting:
RUNTIME_FILTER_WAIT_TIME_MS

In particular, are min/max filters more effective against Kudu PK or partition columns?


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@41
PS9, Line 41: Perf Testing:
Contrived extreme queries are good data points, but how about running the TPCH/DS perf suites
against Kudu?


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@44
PS9, Line 44: - Ran a contrived query with a filter that does eliminate any rows
does not eliminate


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift@202
PS9, Line 202:   // Maximum number of bloom runtime filters allowed per query
I think I understand why you did this, but it seems confusing from a user's perspective. Ok
to leave, but do you have a story around the eventual meaning of existing query options when
HDFS can do min/max and Kudu can do bloom?


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/PlanNode.java@730
PS9, Line 730:     output.append(Joiner.on(", ").join(filtersStr) + "\n");
just return the string? don't think we need the 'output' StringBuilder


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@111
PS9, Line 111:     private final Operator joinOp_;
Let's call this cmpOp_ or exprCmpOp_ or something else because "joinOp_" usually indicates
the join type like left outer, right outer, etc.


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@170
PS9, Line 170:           SlotRef slotRef = expr.unwrapSlotRef(false);
Add a comment stating that the validity of this is checked elsewhere (and where exactly)


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@368
PS9, Line 368:       if (node instanceof HdfsScanNode && type_ != TRuntimeFilterType.BLOOM)
{
I feel like these checks belong in the caller. Having an addTarget() function be a co-op in
some cases seems difficult to reason about. Can be cleaned with a isValidTarget() helper function.


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@463
PS9, Line 463:     // We only enforce a limit on the number of bloom filters as they are much
more
This seems really confusing for users. I'm ok with checking in this version, and let's discuss
separately.


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@1248
PS9, Line 1248: |  runtime filters: RF004 <- o_orderkey, RF005 <- o_clerk
To me the re-numbering is a little strange. We can think about how to address this in a follow-on
change. I'm thinking that ideally users should be able to quickly determine the number of
runtime filters based on the max RF id, so we could assign the real RF id lazily instead of
eagerly.


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test
File testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@66
PS9, Line 66:     and cast(a.int_col as smallint) = b.smallint_col
Does Kudu evaluate this cast? If not, I think the results could be "wrong" or at least different
from the results without a runtime filter.

For example, if there were some legitimate negative values in b.smallint_col, then those could
match with the a.int_col values that are bigger than max smallint because our cast() might
returns negative values.

In any case, this case seems tricky enough to need a separate explanation in the runtime filter
generator.


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@92
PS9, Line 92:    runtime filters: RF005 -> CAST(a.int_col AS SMALLINT) (min_max), RF007
-> a.tinyint_col (min_max)
How do you feel about adding the (bloom) or (min/max) annotation right after the RF ID? To
me that version looks slightly better, but it's subjective. Example:

RF005(min/max) -> CAST(a.int_col AS SMALLINT)
RF007(bloom) -> a.tinyint_col

alternative with brackets

RF005[min/max] -> CAST(a.int_col AS SMALLINT)
RF007[bloom] -> a.tinyint_col


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@2
PS9, Line 2: ---- QUERY
Does column properties like PK or partition column matter to Kudu? We might want to test all
those variants to be sure.



-- 
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mjacobs@apache.org>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <todd@apache.org>
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:33:47 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message