hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho" <sze...@cloudera.com>
Subject Re: Review Request 27265: Support SMB Join for Hive on Spark [Spark Branch]
Date Thu, 30 Oct 2014 19:02:39 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27265/
-----------------------------------------------------------

(Updated Oct. 30, 2014, 7:02 p.m.)


Review request for hive.


Changes
-------

Rebase with mapjoin operator changes, and address review comments.


Repository: hive-git


Description
-------

This change re-uses the SMBJoinOperator for Spark.  Background: the logical layer already
converts joins to SMB Joins.  This changes just introduces a class called "SparkSortMergeJoinFactory"
on the Spark-compile path which attaches the data structures (like local work, bucket info)
to the MapWork for the SMBJoinOperator to consume.  It is largely-based on the MapReduce class
"MapJoinFactory".

However, in spark-path, it is activated only for SMBJoin and not map-joins, as we have another
strategy for map-joins.  That is why there's a new optimizer-rule called "TypeRule", so this
processor is only run on SMBJoinOperators (which share same name with MapJoinOperators, which
is needed for logical-optimizers dealing with hints).

One major assumption around the whole SMB concept is that both tables have corresponding buckets.
 I found during testing of large numbers of buckets (like auto_sortmerge_join_16) that "insert"
into a bucketed table wasn't putting the same keys in corresponding buckets.  I activated
MR-style shuffle (hash-shuffle instead of total-order shuffle), and that seemed to solve the
issue.


Diffs (updated)
-----

  itests/src/test/resources/testconfiguration.properties c429799 
  ql/src/java/org/apache/hadoop/hive/ql/lib/TypeRule.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java da764cf 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinFactory.java PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d33d877 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 3d08d49 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java b94db6b 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 2d7a134 
  ql/src/test/results/clientpositive/spark/auto_join32.q.out 8d83188 
  ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out e64d4fb 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out 9158d65 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out a5a281b 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 2fc3bb6 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 74cbd7c 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_16.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out d1bb7a0 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out d57a1d7 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out 8244c50 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 2ab1bca 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out bc4a163 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out 16ef3ae 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out 9fd3e5a 
  ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out a7f994f 
  ql/src/test/results/clientpositive/spark/bucket2.q.out b1b2997 
  ql/src/test/results/clientpositive/spark/bucket3.q.out 019c11a 
  ql/src/test/results/clientpositive/spark/bucket4.q.out 2cbab11 
  ql/src/test/results/clientpositive/spark/disable_merge_for_bucketing.q.out 590b265 
  ql/src/test/results/clientpositive/spark/load_dyn_part2.q.out f8f8971 
  ql/src/test/results/clientpositive/spark/parquet_join.q.out d5a8684 
  ql/src/test/results/clientpositive/spark/script_pipe.q.out 5b966ff 
  ql/src/test/results/clientpositive/spark/skewjoin.q.out d674d04 
  ql/src/test/results/clientpositive/spark/skewjoin_noskew.q.out d45cdd3 
  ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 482268c 
  ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out efa38d4 
  ql/src/test/results/clientpositive/spark/tez_join_tests.q.out 9254944 

Diff: https://reviews.apache.org/r/27265/diff/


Testing
-------

Ran existing auto_sortmerge_* tests.


Thanks,

Szehon Ho


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message