impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
Date Fri, 27 Jan 2017 23:28:42 GMT
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> Why would we consider the slotrefs of the input? They don't affect the NDV 
The existing algorithm looks at all SlotRef descendants of the Expr. For the CaseExpr, that
includes the when expressions. My thought for the code change was to improve cases where we
know we can do better and fall back to the existing behavior otherwise.

If we know the number of distinct values for all the outputs, then we ignore the inputs completely.

Otherwise, we have at least one output with no statistics. I think it is debatable what to
do in this case. Here is an example SQL:

select distinct case when a.id > 10000 then b.number else a.number end
from a, b;

If we have statistics on table a but not table b, what do we do?

Option 1: Return that we don't know the number of distinct values.
Option 2: Return the max number of distinct values over the then+else clauses. In this case,
that would mean returning the number of distinct values for a.number.
Option 3: Return the max over all SlotRefs (input and output). This would be the max distinct
values over a.number and a.id.

Suppose a.number has 500 distinct values but a.id has 7300 distinct values. 7300 distinct
values implies that there could be 7300 rows coming in (but even this is inexact, as where
clauses can completely change this). Are we better off returning 500 or 7300? Neither is particularly
good. There might be 7300 distinct values for a.id and 1,500,000 for b.number. There might
be 7300 distinct values for a.id and 300 for b.number. 

It is more a question of what to do with incomplete information. Option 3 would err on the
side of guessing too many distinct values, but it can still be completely wrong.


Line 388:     boolean allOutputsKnown = true;
> Why not put this new code into an override of computeNumDistinctValues()?
I will look into this. The way it works today is that Expr's analyze calls computeNumDistinctValues
and subclasses override that by setting numDistinctValues_. This is done in LiteralExpr, Predicate,
and SlotRef.

I think this might be due to the order of execution. SlotRef calls super.analyze first, but
then it still needs to do more setup before initializing numDistinctValues_. CaseExpr casts
expressions after super.analyze. I don't think that would impact the correctness of computeNumDistinctValues.
LiteralExpr and Predicate should be fine overriding computeNumDistinctValues.


Line 390:     long    constOutputVals = 0;
> formatting: long constOutputVals (just one space after type)
Done


Line 392:     long    maxOutputVals = -1;
> maxNonConstNdv? add comment describing what this is
Done


Line 415:           if (!constLiteralSet.contains(thenLiteral)) {
> you can add() and check the return value to see whether the item was alread
Done


Line 424: 
> remove bank line
Done


Line 425:         if (thenNumDistinct == -1) {
> single line
Done


Line 432:     if (hasElseExpr_) {
> good bit of common code between the handing of this else expr and the then 
Combined the then/else part by changing the iteration


Line 460:         // All must be constant, because if we hit any slot, this would be set
> slot -> SlotRef
Done


http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

Line 304:       // get the max number of distinct values over all the children of this node
> sufficient to write:
Done


Line 308:         // only the SlotRefs, except that it allows a Expr to override the values
> nit: an Expr
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message