impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5583: [DOCS] Document default join distribution mode query option
Date Mon, 26 Jun 2017 23:10:03 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-5583: [DOCS] Document default_join_distribution_mode query option
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7300/1/docs/impala.ditamap
File docs/impala.ditamap:

Line 179:           <topicref rev="2.9.0 IMPALA-5381 IMPALA-5583" href="topics/impala_default_join_distribution_mode.xml"/>
Why mention IMPALA-5583 also?


http://gerrit.cloudera.org:8080/#/c/7300/1/docs/topics/impala_default_join_distribution_mode.xml
File docs/topics/impala_default_join_distribution_mode.xml:

Line 40:       This option determines the join strategy that Impala uses when any of the tables
We deliberately did not use "join strategy" in the option name because strategy is too generic.


Line 47:       Hive <codeph>ANALYZE TABLE</codeph> statement.
Sure you want to keep the ANALYZE TABLE part? In most situations we cannot effectively use
what Hive produces.


Line 48:       By default, when a table involved in the join query does not have statistics,
Accuracy could be improved. What if both tables do not have stats? Clarify that one table
is going to be broadcast. Might even be worth explicitly listing what happens if one table
has stats and the other doesn't (the one without stats will be broadcast)


Line 58:       might be missing statistics due to the overhead involved in calculating them,
I wouldn't suppose a particular reason for not having stats.


Line 61:       of a table involved in a join query and only transmits a portion of the table
Not very accurate, both tables are transferred across the network. Not sure if we need to
explain the differences between broadcast+shuffle here, maybe provide a link to their explanation/definition?


Line 67:       recommended when setting up and deploying new clusters. This setting is
We should mention why we recommend this. SHUFFLE is generally a safer option because the join
build will be less prone to spilling and/or OOM.


-- 
To view, visit http://gerrit.cloudera.org:8080/7300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4ec6213efc46bce0fe07c590841d51c009fb5c84
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: John Russell <jrussell@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message