hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3895) Speculative execution algorithm in 1.0 is too pessimistic in many cases
Date Wed, 22 Feb 2012 20:05:53 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213947#comment-13213947

Todd Lipcon commented on MAPREDUCE-3895:

Could the JT simply maintain a "smoothed" indicator of task completion percentage for each
task? So the "jump" from 33% to 66% would get tempered out over a minute or so?

If we have a way of solving this with better heuristics, it seems preferable to adding more
> Speculative execution algorithm in 1.0 is too pessimistic in many cases
> -----------------------------------------------------------------------
>                 Key: MAPREDUCE-3895
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3895
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker, performance
>    Affects Versions: 1.0.0
>            Reporter: Nathan Roberts
> We are seeing many instances where largish jobs are ending up with 30-50% of reduce tasks
being speculatively re-executed. This can be a significant drain on cluster resources. 
> The primary reason is due to the way progress in the reduce phase can make huge jumps
in a very short amount of time. This fact leads the speculative execution code to think lots
of tasks have fallen way behind the average when in fact they haven't
> The important piece of the algorithm is essentially:
> * Am I more than 20% behind the average progress?
> * Have I been running for at least a minute?
> * Have any tasks completed yet?
> Unfortunately, a set of reduce tasks which spend a couple of minutes in the Copy phase,
and very little time in the Sort phase, will trigger all these conditions for a large percentage
of the reduce tasks. (the tasks' progress jump from 33% to 66% almost instantly which then
triggers the speculation). I've seen this on several very large jobs which spend about 2 minutes
in Copy, a few seconds in Sort, and 40 minutes in Reduce. These jobs launch about 30-40% additional
reduce tasks which then run for almost the full 40 minutes. 
> This area becomes more plugable in MRv2 but for 1.0 it would be good if some portion
of this algorithm could be configurable so that a job could have some degree of control (just
disabling speculative execution is not really an option). 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message