hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <>
Subject [jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
Date Thu, 06 Nov 2014 23:16:34 GMT


Jimmy Xiang updated HIVE-8621:
    Fix Version/s: spark-branch
           Status: Patch Available  (was: Open)

> Dump small table join data for map-join [Spark Branch]
> ------------------------------------------------------
>                 Key: HIVE-8621
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Jimmy Xiang
>             Fix For: spark-branch
>         Attachments: HIVE-8621.1-spark.patch
> This jira aims to re-use a slightly modified approach of map-reduce distributed cache
in spark to dump map-joined small tables as hash tables onto spark DFS cluster. 
> This is a sub-task of map-join for spark 
> This can use the baseline patch for map-join
> The original thought process was to use broadcast variable concept in spark, for the
small tables. 
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number of buckets
of tables. If unbucketed, n=1
> But it was discovered that objects compressed with kryo serialization on disk, can occupy
20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all
the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor
failures) although the executors only need individual buckets in their memory. So the broadcast
variable approach may not be the right approach. 

This message was sent by Atlassian JIRA

View raw message