hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2568) Pin reduces with consecutive IDs to nodes and have a single shuffle task per job per node
Date Mon, 04 Feb 2008 09:17:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565291#action_12565291

Amar Kamat commented on HADOOP-2568:

I guess we are talking of following things here :
1) _Bundler_ : Bundles map outputs from different maps on a node for a reducer at the destination
2) _Separator_ : Separates a single map's output for multiple reducers on the destination
3) _Bundler_ and _Separator_ : 1 + 2
The _Bundler_ assumes that multiple maps on the source node gets completed at the same time
so that their outputs for one reducer can be bundled and shipped across the network. This
will require a common _BUNDLER_ at the map side. This might not be true since maps might finish
at different times or might fail and get re-executed.
The _Separator_ assumes that there is a common _SHUFFLER_ at the destination node and the
reducers are pinned to each other every time. But reducers might fail and get re-executed.
Taking care of these dynamics will add a lot of complexity. Not sure how much this will help.

> Pin reduces with consecutive IDs to nodes and have a single shuffle task per job per
> -----------------------------------------------------------------------------------------
>                 Key: HADOOP-2568
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2568
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.17.0
> The idea is to reduce disk seeks while fetching the map outputs. If we opportunistically
pin reduces with consecutive IDs (like 5, 6, 7 .. max-reduce-tasks on that node) on a node,
and have a single shuffle task, we should benefit, if for every fetch, that shuffle task fetches
all the outputs for the reduces it is shuffling for. In the case where we have 2 reduces per
node, we will decrease the #seeks in the map output files on the map nodes by 50%. Memory
usage by that shuffle task would be proportional to the number of reduces it is shuffling
for (to account for the number of ramfs instances, one per reduce). But overall it should
> Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message