hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17018) Small table is converted to map join even the total size of small tables exceeds the threshold(hive.auto.convert.join.noconditionaltask.size)
Date Tue, 11 Jul 2017 21:53:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083058#comment-16083058
] 

liyunzhang_intel commented on HIVE-17018:
-----------------------------------------

[~csun]:  
{quote}A better way might be to have a separate config just for HoS, and maybe a limit on
small table memory per executor.{quote}
what I confused is how to do this? the original code is to calculate whether the total mapjoin
size in the same stage exceed the threshold or not.  Now create a new configure about calculating
the threshold of all small tables according to the spark.executor.memory? If the total size
of small tables in the same stage bigger than the spark.executor.memory, then not allow these
small tables into the same stage but {{hive.auto.convert.join.nonconditionaltask.size}} is
for caculating the total size of mapjoin size of small tables in the query? 

> Small table is converted to map join even the total size of small tables exceeds the
threshold(hive.auto.convert.join.noconditionaltask.size)
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-17018
>                 URL: https://issues.apache.org/jira/browse/HIVE-17018
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-17018_data_init.q, HIVE-17018.q, t3.txt
>
>
>  we use "hive.auto.convert.join.noconditionaltask.size" as the threshold. it means  the
sum of size for n-1 of the tables/partitions for a n-way join is smaller than it, it will
be converted to a map join. for example, A join B join C join D join E. Big table is A(100M),
small tables are B(10M),C(10M),D(10M),E(10M).  If we set hive.auto.convert.join.noconditionaltask.size=20M.
In current code, E,D,B will be converted to map join but C will not be converted to map join.
In my understanding, because hive.auto.convert.join.noconditionaltask.size can only contain
E and D, so C and B should not be converted to map join.  
> Let's explain more why E can be converted to map join.
> in current code, [SparkMapJoinOptimizer#getConnectedMapJoinSize|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L364]
calculates all the mapjoins  in the parent path and child path. The search stops when encountering
[UnionOperator or ReduceOperator|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L381].
Because C is not converted to map join because {{connectedMapJoinSize + totalSize) > maxSize}}
[see code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L330].The
RS before the join of C remains. When calculating whether B will be converted to map join,
{{getConnectedMapJoinSize}} returns 0 as encountering [RS |https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#409]
and causes  {{connectedMapJoinSize + totalSize) < maxSize}} matches.
> [~xuefuz] or [~jxiang]: can you help see whether this is a bug or not  as you are more
familiar with SparkJoinOptimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message