hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <>
Subject [jira] [Commented] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
Date Fri, 07 Nov 2014 00:17:34 GMT


Hive QA commented on HIVE-8621:

{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:

{color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 7123 tests executed
*Failed tests:*

Test results:
Console output:
Test logs:

Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 9 tests failed

This message is automatically generated.

ATTACHMENT ID: 12679991 - PreCommit-HIVE-SPARK-Build

> Dump small table join data for map-join [Spark Branch]
> ------------------------------------------------------
>                 Key: HIVE-8621
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Suhas Satish
>            Assignee: Jimmy Xiang
>             Fix For: spark-branch
>         Attachments: HIVE-8621.1-spark.patch
> This jira aims to re-use a slightly modified approach of map-reduce distributed cache
in spark to dump map-joined small tables as hash tables onto spark DFS cluster. 
> This is a sub-task of map-join for spark 
> This can use the baseline patch for map-join
> The original thought process was to use broadcast variable concept in spark, for the
small tables. 
> The number of broadcast variables that must be created is m x n where
> 'm' is  the number of small tables in the (m+1) way join and n is the number of buckets
of tables. If unbucketed, n=1
> But it was discovered that objects compressed with kryo serialization on disk, can occupy
20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all
the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor
failures) although the executors only need individual buckets in their memory. So the broadcast
variable approach may not be the right approach. 

This message was sent by Atlassian JIRA

View raw message