impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu
Date Fri, 03 Nov 2017 16:07:40 GMT
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/7793
)

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
......................................................................


Patch Set 11:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@13
PS9, Line 13: In RuntimeFilterGenerator in the planner, each hash join node
> ... each hash join node generates a bloom and min-max filter for each equi-
Done


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@27
PS9, Line 27: 
> Not specific to the code changes, and I don't expect a response here (proba
RUNTIME_FILTER_WAIT_TIME_MS applies the same to bloom and min-max - its the time a scan node
will wait for any filter of either type to arrive. Similarly, RUNTIME_FILTER_MODE works the
same for min-max as its worked for bloom.

The size related params - RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_MIN/MAX_SIZE only apply
to bloom filters as the mem used by min-max is small and bounded.

I've updated some comments to make the above clearer.

One tricky case is DISABLE_ROW_RUNTIME_FILTERING. Kudu applies the filters on both a per-partition
and per-row basis. At the moment, there's no way to change this - the filters are applied
through the same interface as eg. predicates that are pushed down, so Kudu assumes that they
have to be applied for correctness, not just for performance.

So, I could see the argument either way - we could disable Kudu filters if DISABLE_ROW_RUNTIME_FILTERING
is true, though it would also disable the partition filtering, or we could just not consider
DISABLE_ROW_RUNTIME_FILTERING for Kudu filters.

The remaining filter related option is MAX_NUM_RUNTIME_FILTERS, which I've addressed in another
comment.


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@41
PS9, Line 41: 
> Contrived extreme queries are good data points, but how about running the T
Those results are posted in the review comments. I can include them here as well, I just felt
it made the commit message excessively large.


http://gerrit.cloudera.org:8080/#/c/7793/9//COMMIT_MSG@44
PS9, Line 44:   timings are averages of 3 runs.
> does not eliminate
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/runtime/timestamp-value.h@164
PS8, Line 164: tvalue.ti
> I'm confused by this variable name.
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc
File be/src/util/min-max-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@27
PS8, Line 27: 
> Can you maybe briefly describe what is tested for each filter type? There s
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@252
PS8, Line 252:       MinMaxFilter::Create(tFilter, string_type, &obj_pool, &mem_pool);
> Would TestEnv work? That is the semi-standard way to create an ExecEnv for 
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter-test.cc@353
PS8, Line 353:   t3.ToTColumnValue(&tFilter2.max);
> Remove?
Done


http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/8/be/src/util/min-max-filter.cc@246
PS8, Line 246: null
> nullptr?
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/7793/9/common/thrift/ImpalaInternalService.thrift@202
PS9, Line 202:   // Maximum number of bloom runtime filters allowed per query
> I think I understand why you did this, but it seems confusing from a user's
There's basically two reasons for this:
- We don't want to regress queries by eliminating bloom filters that used to be generated.
- The purpose of this param in the first place was to prevent the coordinator from being overwhelmed.
Min-max filters are less heavy-weight than bloom filters so they don't put as much stress
on the coordinator anyways.

Given that, I think that it would be fine to maintain this behavior even once HDFS can do
min-max and Kudu can do bloom.

Another concern though is what to do if we add in-list filters, which are probably similarly
heavy-weight to bloom filters.

One option would be to deprecate this and have separate max_num_bloom and max_num_in_list
params, but that may be confusing and requires user action to keep things working as expected
anyways.

Another option at that point would be to keep this param and make it "Maximum number of expensive
filters (bloom or in-list)", and then there's just once we have to worry about regressing
queries.

Or maybe we should just apply this to min-max filters as well, since its the least confusing
option in the long run, possibly bumping up the default a little.

None of the options are ideal.


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/PlanNode.java@730
PS9, Line 730:   }
> just return the string? don't think we need the 'output' StringBuilder
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@111
PS9, Line 111:     // Expr (lhs of join predicate) from which the targetExprs_ are generated.
> Let's call this cmpOp_ or exprCmpOp_ or something else because "joinOp_" us
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@170
PS9, Line 170:         List<Integer> tSlotIds = Lists.newArrayListWithCapacity(sids.size());
> Add a comment stating that the validity of this is checked elsewhere (and w
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@368
PS9, Line 368:           || src_.getChild(0).getCardinality() == -1
> I feel like these checks belong in the caller. Having an addTarget() functi
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@463
PS9, Line 463:           return a.getFilterId().compareTo(b.getFilterId());
> This seems really confusing for users. I'm ok with checking in this version
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@1248
PS9, Line 1248: |  runtime filters: RF004 <- o_orderkey, RF005 <- o_clerk
> To me the re-numbering is a little strange. We can think about how to addre
I filed: IMPALA-6145


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test
File testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@66
PS9, Line 66:     and cast(a.int_col as smallint) = b.smallint_col
> Does Kudu evaluate this cast? If not, I think the results could be "wrong" 
You're correct, it doesn't work. In fact, looking closer there's no type of cast on the target
expr that we can correctly support.


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@92
PS9, Line 92:    tuple-ids=0 row-size=26B cardinality=7300
> How do you feel about adding the (bloom) or (min/max) annotation right afte
Done


http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/9/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@2
PS9, Line 2: ---- QUERY
> Does column properties like PK or partition column matter to Kudu? We might
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 11
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #345
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mjacobs@apache.org>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <todd@apache.org>
Gerrit-Comment-Date: Fri, 03 Nov 2017 16:07:40 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message