hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1218) Use distributed cache to store samples
Date Wed, 17 Feb 2010 19:48:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834960#action_12834960

Ashutosh Chauhan commented on PIG-1218:

On trunk - patch
In POFRJoin#setUpHashMap()
POLoad ld = new POLoad(new OperatorKey("Repl File Loader", 1L),
                    replFile, false);
should it be?
 POLoad ld = new POLoad(new OperatorKey("Repl File Loader", NodeIdGenerator.getGenerator().getNextNodeId("Repl
File Loader")),
                    replfile, false);

Also following can be moved out of for loop to avoid multiple connect() on pc.
 PigContext pc = new PigContext(ExecType.MAPREDUCE, props);                  

In jobControlCompiler#setupDistributedCacheForFRJoin()
new FRJoinDistributedCacheVisitor(mro.reducePlan, pigContext, conf)
Do we need this? Isn't FR Join a map-side join. So, if POFRJoin ends up in mro.reducePlan
thats a bug, no?

> Use distributed cache to store samples
> --------------------------------------
>                 Key: PIG-1218
>                 URL: https://issues.apache.org/jira/browse/PIG-1218
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>         Attachments: PIG-1218.patch, PIG-1218_2.patch
> Currently, in the case of skew join and order by we use sample that is just written to
the dfs (not distributed cache) and, as the result, get opened and copied around more than
necessary. This impacts query performance and also places unnecesary load on the name node

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message