pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xianda Ke (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4771) Implement FR Join for spark engine
Date Mon, 28 Mar 2016 14:02:25 GMT

    [ https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214220#comment-15214220
] 

Xianda Ke commented on PIG-4771:
--------------------------------

hi [~kellyzly],  FR Join is similar to Broadcast Hash Join(Spark SQL), which broadcasts small
RDD.  Broadcasting has better performance than DistributedCache or shared file on HDFS(see
[paper | http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf]).
 We can do some investigation and have a try.

> Implement FR Join for spark engine
> ----------------------------------
>
>                 Key: PIG-4771
>                 URL: https://issues.apache.org/jira/browse/PIG-4771
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4771.patch
>
>
> We use regular join to replace FR join in current code base(fd31fda). We need to implement
FR join.
> Some info collected from https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins:
> *Replicated Joins*
> Fragment replicate join is a special type of join that works well if one or more relations
are small enough to fit into main memory. In such cases, Pig can perform a very efficient
join because all of the hadoop work is done on the map side. In this type of join the large
relation is followed by one or more small relations. The small relations must be small enough
to fit into main memory; if they don't, the process fails and an error is generated.
> *Usage*
> Perform a replicated join with the USING clause (see JOIN (inner) and JOIN (outer)).
In this example, a large relation is joined with two smaller relations. Note that the large
relation comes first followed by the smaller relations; and, all small relations together
must fit into main memory, otherwise an error is generated.
> big = LOAD 'big_data' AS (b1,b2,b3);
> tiny = LOAD 'tiny_data' AS (t1,t2,t3);
> mini = LOAD 'mini_data' AS (m1,m2,m3);
> C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
> *Conditions*
> Fragment replicate joins are experimental; we don't have a strong sense of how small
the small relation must be to fit into memory. In our tests with a simple query that involves
just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message