hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-318) Refactor reduce shuffle code
Date Tue, 22 Sep 2009 19:06:16 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758371#action_12758371

Scott Carey commented on MAPREDUCE-318:

In addition to a quick code review of the bits I was interested in related to fetching map
output fragments, I did a quick and dirty test on trunk on a tiny cluster  to make sure that
this change had the same effect as the one-line fix I apply to 0.19.2 on production for similar
benefits.  See my comment from June 10 2009.  The old code was artificially throttling the
shuffle to one output file per TT per ping-cycle.

Quite simply, any fix that lets a reducer fetch all the complete map outputs it finds in one
ping-cycle helps those jobs with map output counts much greater than node count.  One line
hack or full refactor.  

The impact really depends on the cluster config and job type... ours is new hardware with
plenty of RAM per node which leads to using ~11 + concurrent map tasks per node and a larger
ratio of map shards per reduce to task trackers.  The bigger that ratio, the bigger the impact
of optimized shuffle fetching.

> Refactor reduce shuffle code
> ----------------------------
>                 Key: MAPREDUCE-318
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-318
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.21.0
>         Attachments: HADOOP-5233_api.patch, HADOOP-5233_part0.patch, mapred-318-14Aug.patch,
mapred-318-20Aug.patch, mapred-318-24Aug.patch, mapred-318-3Sep-v1.patch, mapred-318-3Sep.patch,
> The reduce shuffle code has become very complex and entangled. I think we should move
it out of ReduceTask and into a separate package (org.apache.hadoop.mapred.task.reduce). Details
to follow.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message