impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2581: LIMIT can be propagated down into some aggregations
Date Tue, 02 Aug 2016 00:00:15 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2581: LIMIT can be propagated down into some aggregations
......................................................................


Patch Set 1:

(4 comments)

I thought some more about this during the day and I think there are some subtle  issues around
spilling and passthrough aggs with limits that we don't have test coverage for.

http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node-ir.cc
File be/src/exec/partitioned-aggregation-node-ir.cc:

Line 130:   if (group_count_ < max_groups_) {
> We can turn this into a one-liner without an else.
This will also give incorrect results for spilling aggregations. If AGGREGATED_ROWS is true,
this is processing a spilled row that was already aggregated, so will double-count that group.

I think we should add a test to test_spilling.py that exercises this code path (e.g. a large
agg with a high limit).


Line 250:     if (group_count_ >= max_groups_) {
I think this should should actually occur before the remaining_capacity check. Otherwise we'll
pass through rows even when we already have enough groups (meaning the limit isn't correctly
enforced and we do unnecessary work).


http://gerrit.cloudera.org:8080/#/c/3822/1/be/src/exec/partitioned-aggregation-node.cc
File be/src/exec/partitioned-aggregation-node.cc:

PS1, Line 506: ReachedLimit
Do we actually need these limit checks for grouping aggregations now? If not, we should remove
the redundant checks and document how the limit is enforced.


Line 519: Status PartitionedAggregationNode::GetRowsStreaming(RuntimeState* state,
We don't actually implement limit checks in GetRowsStreaming(), since previously limits were
never applied.

I think in some circumstances (e.g. if it switches into streaming mode) the current version
of the patch will return more rows than the limit. I think if you fix my other comment in
-ir.cc this may not be the case, but it would be good to think through and document why the
explicit limit check isn't necessary (if it indeed isn't).


-- 
To view, visit http://gerrit.cloudera.org:8080/3822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I59c5b7af7a73ccdbc5496b28eacb9b6859d202bc
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple <jbapple@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message