hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Sun" <chao....@cloudera.com>
Subject Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
Date Sun, 09 Nov 2014 05:56:57 GMT


> On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java,
line 214
> > <https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214>
> >
> >     This assumes that result SparkWorks will be linearly dependent on each other,
which isn't true in general.Let's say the are two works (w1 and w2), each having a map join
operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency
in this scenario will be graphic rather than linear.
> 
> Chao Sun wrote:
>     I was thinking, in this case, if there's no dependency between w1 and w2, they can
be put in the same SparkWork, right?
>     Otherwise, they will form a linear dependency too.
> 
> Xuefu Zhang wrote:
>     w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends
on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not
linear.
>     
>     To put more details, for each work that has map join op, we need to create a SparkWork
to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1
and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created.

I'm not getting it, why "This dependency is not linear"? Can you give a counter example?
Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following:

     HTS_1   HTS_2     HTS_3    HTS_4
       \      /           \     /
        \    /             \   /
          MJ_1              MJ_2
           |                 |
           |                 |
          HTS_5            HTS_6
              \            /
               \          /
                \        /
                 \      /
                  \    /
                    MJ_3
                    
Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same SparkWork, say SW_1
then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 in another
SparkWork SW_3.
SW_1 -> SW_2 -> SW_3.


- Chao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27627/#review60482
-----------------------------------------------------------


On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27627/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 6:07 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8622
>     https://issues.apache.org/jira/browse/HIVE-8622
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 
> 
> Diff: https://reviews.apache.org/r/27627/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chao Sun
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message