hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suhas Satish (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]
Date Thu, 30 Oct 2014 02:45:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189531#comment-14189531
] 

Suhas Satish commented on HIVE-8621:
------------------------------------

Currently so far in the spark implementation, we are not tagging the small tables, but I realized
that we need to tag them to be able to use different broadcast variables for different tables.


Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join. 

In M/R, we have only one HashTableSink Operator (HTS) for all small tables combined. This
conversion from RS-> HTS 
happens in LocalMapJoinProcFactory and is  triggered by rule R7  (MapReduceCompiler: MapJoinFactory.getTableScanMapJoin
)    in TaskCompiler.optimizeTaskPlan phase. 

Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we will end up
with 2 HashTableSinks (or in general, (n-1) HTS for n-way join). Each of these will generate
its broadcast variable. 

After going through Sandy Ryza's spark presentation here, 
http://www.slideshare.net/SandyRyza/spark-job-failures-talk
it looks like the recommended way to distribute compute in spark is to have a large number
of SparkTasks. So I think its better to have each MapWork from each small table as a separate
SparkTask. This can be tackled independently in this jira if you guys agree 
https://issues.apache.org/jira/browse/HIVE-8622


> Dump small table join data into appropriate number of broadcast variables [Spark Branch]
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-8621
>                 URL: https://issues.apache.org/jira/browse/HIVE-8621
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Suhas Satish
>
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number of buckets
of tables. If unbucketed, n=1
> This is a sub-task of map-join for spark 
> https://issues.apache.org/jira/browse/HIVE-7613
> This can use the baseline patch for map-join
> https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message