hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carter Shanklin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17018) Small table is converted to map join even the total size of small tables exceeds the threshold(hive.auto.convert.join.noconditionaltask.size)
Date Sat, 15 Jul 2017 20:43:02 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088721#comment-16088721
] 

Carter Shanklin commented on HIVE-17018:
----------------------------------------

My inputs:

That particular variable can't be renamed to something spark specific since all engines use
it
Adding a net new variable for Spark would increase confusion rather than decrease it.

It would be good to have some sort of descriptive name that applies to both Tez and MR. As
pointed out there is no relation between what that variable used to do and what it does today,
and the implication of changing that parameter is difficult to guess.

Maybe a new variable like hive.auto.convert.join.max.hashtable.size could be introduced. Both
engines switch to that variable at some point, then usage of the old variable could be deprecated
and then removed.

Just my inputs. /cc [~ashutoshc]

> Small table is converted to map join even the total size of small tables exceeds the
threshold(hive.auto.convert.join.noconditionaltask.size)
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-17018
>                 URL: https://issues.apache.org/jira/browse/HIVE-17018
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-17018_data_init.q, HIVE-17018.q, t3.txt
>
>
>  we use "hive.auto.convert.join.noconditionaltask.size" as the threshold. it means  the
sum of size for n-1 of the tables/partitions for a n-way join is smaller than it, it will
be converted to a map join. for example, A join B join C join D join E. Big table is A(100M),
small tables are B(10M),C(10M),D(10M),E(10M).  If we set hive.auto.convert.join.noconditionaltask.size=20M.
In current code, E,D,B will be converted to map join but C will not be converted to map join.
In my understanding, because hive.auto.convert.join.noconditionaltask.size can only contain
E and D, so C and B should not be converted to map join.  
> Let's explain more why E can be converted to map join.
> in current code, [SparkMapJoinOptimizer#getConnectedMapJoinSize|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L364]
calculates all the mapjoins  in the parent path and child path. The search stops when encountering
[UnionOperator or ReduceOperator|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L381].
Because C is not converted to map join because {{connectedMapJoinSize + totalSize) > maxSize}}
[see code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L330].The
RS before the join of C remains. When calculating whether B will be converted to map join,
{{getConnectedMapJoinSize}} returns 0 as encountering [RS |https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#409]
and causes  {{connectedMapJoinSize + totalSize) < maxSize}} matches.
> [~xuefuz] or [~jxiang]: can you help see whether this is a bug or not  as you are more
familiar with SparkJoinOptimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message