pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-554) Fragment Replicate Join
Date Tue, 23 Dec 2008 12:35:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658834#action_12658834

Shravan Matthur Narayanamurthy commented on PIG-554:

1) Consider the following script:
A = load 'file1';
B = load 'file2';
C = filter A by $0>10;
D = filter B by $0<10;
E = join C by $0, D by $0 using replicated;

We need to materialize the result of D before we can use it as replicated input. Also DC has
not been used as it doesn't support directories iirc (we will have to handle many complications
manually) and the load specification in pig can contain regexps too. Also as the size of the
replicated file is small it doesn't make too much diff.

2) Instead of writing all the code to handle the various combinations of the group item specification,
I chose to use LR which already does it. I think I store only the plain tuple(extracted from
the LR ouput) and not the LR output in the hashtables. So it doesn't add to any memory overhead.
The LR is used only to separate out key & value and these are stored as a mapping from
key to value (plain tuples).

> Fragment Replicate Join
> -----------------------
>                 Key: PIG-554
>                 URL: https://issues.apache.org/jira/browse/PIG-554
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>         Attachments: frjofflat.patch, frjofflat1.patch
> Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a
very small table (fitting in memory small) and the join doesn't expand the data by much. The
idea is to distribute the processing of the huge files by fragmenting it and replicating the
small file to all machines receiving a fragment of the huge file. Because of the availability
of the entire small file, the join becomes a trivial task without needing any break in the
pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are
the details: http://wiki.apache.org/pig/PigFRJoin
> The patch makes changes to parts of the code where new operators are introduced. Currently,
when a new operator is introduced, its alias is not set. For schema computation I have modified
this behaviour to set the alias of the new operator to that of its predecessor. The logical
side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins. The rest of the
code has been documented.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message