hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1458) aggregate files for replicated join
Date Sat, 28 Aug 2010 21:52:54 GMT

    [ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903895#action_12903895
] 

Thejas M Nair commented on PIG-1458:
------------------------------------

Another comment about the patch -
- The test testUnknownNumMaps2 is same as testUnknownNumMaps, it should be removed .


A note about the 2nd case described in first comment -
bq. 2.  The right input is a map-only job and input files do not exist at the compile time.

When the input does not exist for the input map-only job, in most(/all ?) cases it would be
possible to determine the number of files by looking at the previous MR operator (or ones
before that).
Also, with current implementation, since the checks for number of files are being done before
the MR jobs are merged together, there will be cases where the final plan has only one MR
job with existing input for the replicated input and pig still considers it as a case 2.

The example used in testUnknownNumMaps() has only one input MR job with inputs that exist
at compile time, but if pig.frjoin.merge.files.optimistic=false, it will create an additional
MR job that combines the input -
{code}
A = LOAD '" + INPUT_FILE + "' as (x:int,y:int);
B = Filter A by x < 50;
C = join A by $0, B by $0 using 'repl';
{code}


> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, this puts
 unneeded burden on the name node. pre-aggregating the files can improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message