hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suhas Satish (JIRA)" <>
Subject [jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]
Date Thu, 30 Oct 2014 02:45:34 GMT


Suhas Satish commented on HIVE-8621:

Currently so far in the spark implementation, we are not tagging the small tables, but I realized
that we need to tag them to be able to use different broadcast variables for different tables.

Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join. 

In M/R, we have only one HashTableSink Operator (HTS) for all small tables combined. This
conversion from RS-> HTS 
happens in LocalMapJoinProcFactory and is  triggered by rule R7  (MapReduceCompiler: MapJoinFactory.getTableScanMapJoin
)    in TaskCompiler.optimizeTaskPlan phase. 

Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we will end up
with 2 HashTableSinks (or in general, (n-1) HTS for n-way join). Each of these will generate
its broadcast variable. 

After going through Sandy Ryza's spark presentation here,
it looks like the recommended way to distribute compute in spark is to have a large number
of SparkTasks. So I think its better to have each MapWork from each small table as a separate
SparkTask. This can be tackled independently in this jira if you guys agree

> Dump small table join data into appropriate number of broadcast variables [Spark Branch]
> ----------------------------------------------------------------------------------------
>                 Key: HIVE-8621
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Suhas Satish
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number of buckets
of tables. If unbucketed, n=1
> This is a sub-task of map-join for spark 
> This can use the baseline patch for map-join

This message was sent by Atlassian JIRA

View raw message