impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5612: join inversion should factor in parallelism
Date Sat, 08 Jul 2017 00:37:47 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5612: join inversion should factor in parallelism
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7351/2/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 386:    *    cardinality*avgSerializedSize. Do not invert if relevant stats are missing.
> Update comment to add the 4th case?
Updated #3 to include a pointer to isInvertedJoinCheaper(). There were some other errors in
the comment too.


PS2, Line 424: invertedJoinIsCheaper
> nit:isInvertedJoinCheaper()?
Done


PS2, Line 459: rhsCard
Should have been lhs/rhsBytes.


PS2, Line 459: (log_b(rhsBytes) + C) * (lhsCard + 2 * rhsCard)
> What is the unit of this? In other words what exactly are we trying to opti
It's really execution time, with an arbitrary unit.

There were a couple of inaccuracies in the comments w.r.t. bytes versus rows.

We definitely need to multiply 2 * rhsCard by the log since we need to access the hash table
while doing the build.

Multiplying the whole thing by C doesn't do anything since it won't change the result of the
final comparison. Adding C in that place makes sense.

E.g. consider 10 rows on the build side versus 100 rows. Without the + C, the model say that
there's a log(100) / log(10) = 2x perf difference per probe row, which doesn't seem realistic.
with C = 5 you get a 1.167x difference, which seems more realistic. Using log2() tended to
have the same problem - it seemed overly pessimistic about how much slower things would get.

The good thing is that the model mostly isn't sensitive to changes in parameters for larger
input sizes.


Line 488:     final long CONSTANT_COST_PER_ROW = 5;
> How was this chosen?
I basically played around with parameters and picked values that seemed to produce reasonable
results. 

Updated the class comment to mention this.


PS2, Line 491: log10
> Shouldn't this be base 2? Don't think it matters as long as we use same for
This was also empirically chosen.


-- 
To view, visit http://gerrit.cloudera.org:8080/7351
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icacea4565ce25ef15aaab014684c9440dd501d4e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message