impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5612: join inversion should factor in parallelism
Date Sat, 08 Jul 2017 00:37:47 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5612: join inversion should factor in parallelism

Patch Set 2:

File fe/src/main/java/org/apache/impala/planner/

Line 386:    *    cardinality*avgSerializedSize. Do not invert if relevant stats are missing.
> Update comment to add the 4th case?
Updated #3 to include a pointer to isInvertedJoinCheaper(). There were some other errors in
the comment too.

PS2, Line 424: invertedJoinIsCheaper
> nit:isInvertedJoinCheaper()?

PS2, Line 459: rhsCard
Should have been lhs/rhsBytes.

PS2, Line 459: (log_b(rhsBytes) + C) * (lhsCard + 2 * rhsCard)
> What is the unit of this? In other words what exactly are we trying to opti
It's really execution time, with an arbitrary unit.

There were a couple of inaccuracies in the comments w.r.t. bytes versus rows.

We definitely need to multiply 2 * rhsCard by the log since we need to access the hash table
while doing the build.

Multiplying the whole thing by C doesn't do anything since it won't change the result of the
final comparison. Adding C in that place makes sense.

E.g. consider 10 rows on the build side versus 100 rows. Without the + C, the model say that
there's a log(100) / log(10) = 2x perf difference per probe row, which doesn't seem realistic.
with C = 5 you get a 1.167x difference, which seems more realistic. Using log2() tended to
have the same problem - it seemed overly pessimistic about how much slower things would get.

The good thing is that the model mostly isn't sensitive to changes in parameters for larger
input sizes.

Line 488:     final long CONSTANT_COST_PER_ROW = 5;
> How was this chosen?
I basically played around with parameters and picked values that seemed to produce reasonable

Updated the class comment to mention this.

PS2, Line 491: log10
> Shouldn't this be base 2? Don't think it matters as long as we use same for
This was also empirically chosen.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Icacea4565ce25ef15aaab014684c9440dd501d4e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message