hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2568) Pin reduces with consecutive IDs to nodes and have a single shuffle task per job per node
Date Sun, 03 Feb 2008 02:32:08 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565136#action_12565136
] 

Runping Qi commented on HADOOP-2568:
------------------------------------


Fetching all the available segments (produced by multiple mappers)  for the same reducer from
a node  
is a good idea, and should be easy to implement. It will definitely improve the shuffling
efficiency
by reducing the number of round-trips and by increasing the payload size. This will be especially
significant
for large jobs with a large number of mappers per node. 

Introducing a separate shuffling phase need more study. It will complicate the framework significantly.
Depending on the actual implementation, the net benefits are not that obvious.



> Pin reduces with consecutive IDs to nodes and have a single shuffle task per job per
node
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2568
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2568
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.17.0
>
>
> The idea is to reduce disk seeks while fetching the map outputs. If we opportunistically
pin reduces with consecutive IDs (like 5, 6, 7 .. max-reduce-tasks on that node) on a node,
and have a single shuffle task, we should benefit, if for every fetch, that shuffle task fetches
all the outputs for the reduces it is shuffling for. In the case where we have 2 reduces per
node, we will decrease the #seeks in the map output files on the map nodes by 50%. Memory
usage by that shuffle task would be proportional to the number of reduces it is shuffling
for (to account for the number of ramfs instances, one per reduce). But overall it should
help. 
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message