impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5483: Automatically disable codegen for small queries
Date Thu, 22 Jun 2017 22:58:56 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5483: Automatically disable codegen for small queries

Patch Set 3:

Commit Message:

PS3, Line 15: a plan node
> Unless we are sure this is a common user scenario, I lean towards a default
If users don't have any queries small enough for this to kick in then there's no chance of
them regressing. All the small queries I've looked at show a nice benefit from this so the
two like scenarios in my mind are:
* all queries are too large and it has no effect
* there are small queries and it noticably improves the response time of the majority of them
(maybe regressing a small fraction to some degree)

I suspect that these kind of queries are pretty common if people are doing exploratory work
with hand-written queries or queries generated by tools (one of the motivations for this was
that a BI tool was generating queries with joins with complex expressions over moderate amounts
of data). I've seen queries that would benefit from this on other clusters as well.

I'd be reluctant to merge this if it's off by default since realistically most users won't
know to turn it on. I think shipping optimisations off by default either means our defaults
are bad or we don't have enough confidence in the optimisation to justify the work. I'm open
to adjusting the threshold to a lower number.
File be/src/service/

Line 497:         query_options->__set_disable_codegen_rows_threshold(atoi(value.c_str()));
> Consider using StringParser::StringToInt() like in other places here, so we
File common/thrift/ImpalaInternalService.thrift:

Line 255:   // If the number of rows per node is below the threshold codegen will be automatically
> If the number of rows processed per node ...
File common/thrift/ImpalaService.thrift:

Line 279:   // If the number of rows per node is below the threshold and disable_codegen is
> If the number of rows processed per node
File fe/src/main/java/org/apache/impala/planner/

Line 151: 
> Needs PlannerTests. You can do something like PlannerTest.testComputeStatsM
Happy to do this but will defer the work until we have a clear decision about whether this
is worth doing.

Line 152:     checkForDisableCodegen(rootFragment.getPlanRoot());
> Add comment why we have to do this on the fragmented plan

Line 477:     boolean isSmallQuery = visitor.valid() && visitor.getMaxRowsProcessed()
< threshold;
> remove visitor.valid() and inline result into the if

Line 492:   private void checkForDisableCodegen(PlanNode planRoot) {
> planRoot -> distributedPlan

Line 493:     // Determine the maximum number of rows processed by a instance of a plan node.
> This comments reads like the comment on MaxRowsProcessedVisitor.maxRowsProc
Reworked this comment to explain the motivation for using the number of rows per node as a
File fe/src/main/java/org/apache/impala/util/

Line 32:   // True if we should abort because if we don't have valid estimates
> remove second "if"

Line 47:     if (caller instanceof HashJoinNode || caller instanceof NestedLoopJoinNode) {
> if (caller instanceof JoinNode) foundJoinNode_ = true;

Line 60:           || (missingStats && !scan.hasLimit() && scan.getConjuncts().isEmpty()))
> Why would the existence of scan conjuncts make our estimate valid despite m
In that case the scan returns every row it scans and hits the limit after processing a bounded
amount of data.

Line 66:           Math.max(maxRowsProcessedPerNode_, numRows / numNodes);
> Nit: Do the division as double and take the ceil, otherwise it seems like w

Line 77:           Math.max(maxRowsProcessedPerNode_, numRows / numNodes);
> Nit: Do the division as double and take the ceil, otherwise it seems like w

Line 81:   public boolean valid() {
> single line

Line 96:     return foundJoinNode_;
> Preconditions.checkState(valid_); seems appropriate here as well
File testdata/workloads/functional-query/queries/QueryTest/codegen.test:

Line 2: ---- QUERY
> Any reason not to make these PlannerTests?
To confirm that it's actually plumbed all the way through to the backend.
File tests/common/

Line 139:                                       disable_codegen_rows_threshold_options=[5000],
> Why not 0? For the same reason you use 0 below.
The reasoning below doesn't apply since disable_codegen is always on here. 

My thinking was that tests that were intended to cover paths that could be codegened or not
should not be using create_single_exec_option_dimension() in the first place (since they'd
be missing coverage of the non-codegened paths).

If we want to preserve the same coverage of codegen that we had previously I could make it
0. I thought on the balance here it made more sense to enable the toggling behaviour so that
the behaviour in tests diverged less from the default behaviour of Impala.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Juan Yu <>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message