hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang" <xzh...@cloudera.com>
Subject Re: Review Request 28930: HIVE-8639 : Convert SMBJoin to MapJoin [Spark Branch]
Date Thu, 18 Dec 2014 19:55:40 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28930/#review65528
-----------------------------------------------------------

Ship it!


Ship It!

- Xuefu Zhang


On Dec. 18, 2014, 2:07 a.m., Szehon Ho wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28930/
> -----------------------------------------------------------
> 
> (Updated Dec. 18, 2014, 2:07 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8639
>     https://issues.apache.org/jira/browse/HIVE-8639
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> In MapReduce for auto-SMB joins, SortedMergeJoinProc is run in the earlier Optimizer
layer to convert join to SMB join, and SortMergeJoinResolver is run in later PhysicalOptimizer
layer to convert it to MapJoin.  For Spark, we have an opportunity to make it cleaner by deciding
putting both SMB and MapJoin conversions in the logical layer and deciding which one to call.
> 
> This patch introduces a new unitied join processor called 'SparkJoinOptimizer' in the
logical layer.  This will call 'SparkMapJoinOptimizer' and 'SparkSortMergeJoinOptimizer' in
a certain order depending on the flags that are set and which ever one is available fails.
 Thus no need to write a SMB -> MapJoin path.
> 
> 'SparkSortMergeJoinOptimizer' is a new class that wraps the logic of SortedMergeJoinProc
but for Spark.  To put both MapJoin/SMB processor in the same level, I had to do some fixes.
 
> 
> 1.  One fix is in 'NonBlockingOpDeDupProc', to fix the join context state, as now its
run before the SMB code that relies on it.  For this I submitted a trunk patch at HIVE-9060.
> 2.  The second fix is that MapReduce's SMB code did two graph walks, one processor to
calculate all 'rejected' joins, and another processor to change the non-rejected ones to SMB
join.  That would have made it so we do multiple walks, so I refactored the 'rejected' join
logic in the same join-operator visit in SparkSortMergeJoinOptimizer.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java c2e643d 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java PRE-CREATION

>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 680c6fd

>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java
83625ef 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 5e432ac 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java b6a7ac2 
>   ql/src/test/results/clientpositive/spark/auto_join32.q.out 28c022e 
>   ql/src/test/results/clientpositive/spark/auto_join_stats.q.out bccd246 
>   ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 842b4b3 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out 2e35c66 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_12.q.out ee37010 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out b2e928f 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 20ee657 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 0a48d00 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out 5008a3f 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out 3b081af 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out 2a11fb2 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 0d971d2 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out 9d455dc 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out 61eb6ae 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out 198d50d 
>   ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out f59e57f 
>   ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_2.q.out b58091c

>   ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_4.q.out 8ee392e

>   ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_6.q.out 9c119df

>   ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_7.q.out b9ad92d

>   ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_8.q.out ed4d03f

>   ql/src/test/results/clientpositive/spark/cross_product_check_2.q.out 6fb69a5 
>   ql/src/test/results/clientpositive/spark/parquet_join.q.out 240989a 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 268ae23 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out df66cc2 
>   ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out f635949 
> 
> Diff: https://reviews.apache.org/r/28930/diff/
> 
> 
> Testing
> -------
> 
> Most of the auto-smb tests give the same output with this change, the only difference
is now some SMB joins become MapJoin if "hive.auto.convert.sortmerge.join.to.mapjoin" is on,
as expected.
> 
> One failing test is auto_sortmerge_join_9.  This was passing until yesterday when bucket-map
join is enabled in HIVE-8638.  As expected, by choosing MapJoins over SMB join if "hive.auto.convert.sortmerge.join.to.mapjoin"
is on, the MapJoin may become a bucket-mapjoin.  Some of the more complicated queries of auto_sortmerge_join_9
get converted to bucket mapjoin and fail.  Can probably file a new JIRA to fix this test.
> 
> 
> Thanks,
> 
> Szehon Ho
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message