hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-162) Rework mapreduce submission and monitoring
Date Mon, 07 Apr 2008 19:27:26 GMT

    [ https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586502#action_12586502
] 

shravanmn edited comment on PIG-162 at 4/7/08 12:27 PM:
-----------------------------------------------------------------------------

After some thought, here is what I think. The solution that works for all cases would be to
store the output of the split right away into different hdfs files and add suitable loads
at the other end. However, if we are in the map phase, the store and load would probably be
more expensive than if we were to create multiple pipelines that are replicas of the pipeline
in the map phase that is below the split and attach it with appropriate filters to the other
end. Also the same applies if there exists a diamond structure and in fact if the split occurs
in the reduce phase, we only need to create pipeline replicas for the pipeline below the split
only till the Map-Reduce boundary. But this solution would not do that well if we have scenario
one pointed out by alan. Because there is no pipeline on the other end for r1 when store q1
is executed and hence in the worst case the last job in the job dag that has the split might
run again.

Also the cost of implementing the solution where we check for diamond structures and differentiate
between map and reduce occurences will take time to implement as the code that does thie Physical
to MR translation I wrote a few days back did not consider this kind of an optimization. It
would probably take a couple of days to modify it.

I have attached a [figure|https://issues.apache.org/jira/secure/attachment/12379589/split.png]
that shows the replication idea in case of a diamond structure and split occuring in Reduce
phase.

So please suggest if its worth modifying or implementing the store solution and pushing this
optimization either to the optimization layer or to a later point in time. Also the current
pig trunk code also uses the store and load approach.

      was (Author: shravanmn):
    After some thought, here is what I think. The solution that works for all cases would
be to store the output of the split right away into different hdfs files and add suitable
loads at the other end. However, if we are in the map phase, the store and load would probably
be more expensive than if we were to create multiple pipelines that are replicas of the pipeline
in the map phase that is below the split and attach it with appropriate filters to the other
end. Also the same applies if there exists a diamond structure and in fact if the split occurs
in the reduce phase, we only need to create pipeline replicas for the pipeline below the split
only till the Map-Reduce boundary. But this solution would not do that well if we have scenario
one pointed out by alan. Because there is no pipeline on the other end for r1 when store q1
is executed and hence in the worst case the last job in the job dag that has the split might
run again.

Also the cost of implementing the solution where we check for diamond structures and differentiate
between map and reduce occurences will take time to implement as the code that does thie Physical
to MR translation I wrote a few days back did not consider this kind of an optimization. It
would probably take a couple of days to modify it.

I have attached a figure that shows the replication idea in case of a diamond structure and
split occuring in Reduce phase.

So please suggest if its worth modifying or implementing the store solution and pushing this
optimization either to the optimization layer or to a later point in time. Also the current
pig trunk code also uses the store and load approach.
  
> Rework mapreduce submission and monitoring
> ------------------------------------------
>
>                 Key: PIG-162
>                 URL: https://issues.apache.org/jira/browse/PIG-162
>             Project: Pig
>          Issue Type: Sub-task
>         Environment: This bug tracks works to rework the submission and monitoring interface
to map reduce as described in  http://wiki.apache.org/pig/PigTypesFunctionalSpec
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: split.png
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message