hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
Date Thu, 18 Sep 2014 05:59:34 GMT


Xuefu Zhang commented on HIVE-7613:

Here is what I have in mind:

1. For N-way join being converting to a map join, we can run N-1 Spark jobs, one for each
small input to the join (assuming transformation is needed. If not, then we don't need a Spark
job). Each job generates some RDD at the end, so we have N-1 RDDs in the end.

2. Dump the content of RDDs into the data structure (hash tables) that's needed by MapJoinOperator.

3. Call SparkContext.broadcast() on that data structure. This will broadcast the data struture
to all nodes.

4. Then, we can launch the map only, join job, which can load the broadcasted data structure
via the HashTableLoader interface.

For more information about Spark's broadcast variable, please refer to

> Research optimization of auto convert join to map join [Spark branch]
> ---------------------------------------------------------------------
>                 Key: HIVE-7613
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Chengxiang Li
>            Assignee: Suhas Satish
>            Priority: Minor
>         Attachments: HIve on Spark Map join background.docx
> ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with
a map join(aka broadcast or fragment replicate join) when possible. we need to research how
to make it workable with Hive on Spark.

This message was sent by Atlassian JIRA

View raw message