hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3278) 0.20: avoid a busy-loop in ReduceTask scheduling
Date Thu, 27 Oct 2011 04:53:32 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated MAPREDUCE-3278:

    Attachment: reducer-cpu-usage.png

Here's a before-after of a node running terasort. On the left terasort (unpatched) you can
see when the reducers start and eat up a ton of CPU. On the right (patched) terasort, the
reducers add more iowait but CPU usage is minimal. top showed the reducers in fetch stage
using ~15% CPU instead of ~105% CPU. Total terasort time improved by 10% or so. I'll upload
a patch after a bit more testing.
> 0.20: avoid a busy-loop in ReduceTask scheduling
> ------------------------------------------------
>                 Key: MAPREDUCE-3278
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3278
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv1, performance, task
>    Affects Versions:
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: reducer-cpu-usage.png
> Looking at profiling results, it became clear that the ReduceTask has the following busy-loop
which was causing it to suck up 100% of CPU in the fetch phase in some configurations:
> - the number of reduce fetcher threads is configured to more than the number of hosts
> - therefore "busyEnough()" never returns true
> - the "scheduling" portion of the code can't schedule any new fetches, since all of the
pending fetches in the mapLocations buffer correspond to hosts that are already being fetched
(the hosts are in the {{uniqueHosts}} map)
> - {{getCopyResult()}} immediately returns null, since there are no completed maps.
> Hence ReduceTask spins back and forth between trying to schedule things (and failing),
and trying to grab completed results (of which there are none), with no waits.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message