hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suhas Satish (JIRA)" <>
Subject [jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
Date Tue, 04 Nov 2014 15:56:38 GMT


Suhas Satish updated HIVE-8621:
    Assignee:     (was: Suhas Satish)

> Dump small table join data for map-join [Spark Branch]
> ------------------------------------------------------
>                 Key: HIVE-8621
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
> This jira aims to re-use a slightly modified approach of map-reduce distributed cache
in spark to dump map-joined small tables as hash tables onto spark DFS cluster. 
> This is a sub-task of map-join for spark 
> This can use the baseline patch for map-join
> The original thought process was to use broadcast variable concept in spark, for the
small tables. 
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number of buckets
of tables. If unbucketed, n=1
> But it was discovered that objects compressed with kryo serialization on disk, can occupy
20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all
the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor
failures) although the executors only need individual buckets in their memory. So the broadcast
variable approach may not be the right approach. 

This message was sent by Atlassian JIRA

View raw message