hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho" <>
Subject Review Request 28930: HIVE-8639 : Convert SMBJoin to MapJoin [Spark Branch]
Date Thu, 11 Dec 2014 04:39:35 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for hive.

Bugs: HIVE-8639

Repository: hive-git


In MapReduce for auto-SMB joins, SortedMergeJoinProc is run in the earlier Optimizer layer
to convert join to SMB join, and SortMergeJoinResolver is run in later PhysicalOptimizer layer
to convert it to MapJoin.  For Spark, we have an opportunity to make it cleaner by deciding
putting both SMB and MapJoin conversions in the logical layer and deciding which one to call.

This patch introduces a new unitied join processor called 'SparkJoinOptimizer' in the logical
layer.  This will call 'SparkMapJoinOptimizer' and 'SparkSortMergeJoinOptimizer' in a certain
order depending on the flags that are set and which ever one is available fails.  Thus no
need to write a SMB -> MapJoin path.

'SparkSortMergeJoinOptimizer' is a new class that wraps the logic of SortedMergeJoinProc but
for Spark.  To put both MapJoin/SMB processor in the same level, I had to do some fixes. 

1.  One fix is in 'NonBlockingOpDeDupProc', to fix the join context state, as now its run
before the SMB code that relies on it.  For this I submitted a trunk patch at HIVE-9060.
2.  The second fix is that MapReduce's SMB code did two graph walks, one processor to calculate
all 'rejected' joins, and another processor to change the non-rejected ones to SMB join. 
That would have made it so we do multiple walks, so I refactored the 'rejected' join logic
in the same join-operator visit in SparkSortMergeJoinOptimizer.


  ql/src/java/org/apache/hadoop/hive/ql/optimizer/ 63862b9 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/ a8a3d86 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/ PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/ 7a716a9

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/ PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/ 24e1460 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out 2e35c66 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out b2e928f 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 20ee657 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 0a48d00 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out 5008a3f 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out 3b081af 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out 2a11fb2 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 0d971d2 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out 9d455dc 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out 61eb6ae 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out 198d50d 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out f59e57f 



Most of the auto-smb tests give the same output with this change, the only difference is now
some SMB joins become MapJoin if "" is on, as expected.

One failing test is auto_sortmerge_join_9.  This was passing until yesterday when bucket-map
join is enabled in HIVE-8638.  As expected, by choosing MapJoins over SMB join if ""
is on, the MapJoin may become a bucket-mapjoin.  Some of the more complicated queries of auto_sortmerge_join_9
get converted to bucket mapjoin and fail.  Can probably file a new JIRA to fix this test.


Szehon Ho

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message