hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho (JIRA)" <>
Subject [jira] [Commented] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable
Date Mon, 27 Oct 2014 22:39:34 GMT


Szehon Ho commented on HIVE-8621:

Hi Suhas, thanks for creating the JIRA.  I think that we should actually have m X n variables
(m= numSmallTables, n=numBuckets).  If you read the code of MapJoinOperator, it's processing
logic as I understand keeps them separate data structures.  It will be better if we can re-use
that operator.

We can still do everything else as planned (m MapTasks that are union'ed), but during collection
phase, it should be easy for us to make one variable per table.  (by checking the alias tag).
 It is the same logic that MapReduce is dividing per table (see HashTableSinkOperator.flushToFile()).
 Let me know if that makes sense.

The trickier part is for bucket join, how to get one variable per bucket after results are
collected, there more research is needed.

> Aggregate all small table join data into 1 broadcast variable
> -------------------------------------------------------------
>                 Key: HIVE-8621
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Suhas Satish
> This is a sub-task of map-join for spark 
> This can use the baseline patch for map-join

This message was sent by Atlassian JIRA

View raw message