hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3895) Speculative execution algorithm in 1.0 is too pessimistic in many cases
Date Wed, 22 Feb 2012 18:49:50 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213842#comment-13213842
] 

Mahadev konar commented on MAPREDUCE-3895:
------------------------------------------

+1 on doing this on 0.23. I agree on this being a major change and could destabilize 1.0.
                
> Speculative execution algorithm in 1.0 is too pessimistic in many cases
> -----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3895
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3895
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker, performance
>    Affects Versions: 1.0.0
>            Reporter: Nathan Roberts
>
> We are seeing many instances where largish jobs are ending up with 30-50% of reduce tasks
being speculatively re-executed. This can be a significant drain on cluster resources. 
> The primary reason is due to the way progress in the reduce phase can make huge jumps
in a very short amount of time. This fact leads the speculative execution code to think lots
of tasks have fallen way behind the average when in fact they haven't
> The important piece of the algorithm is essentially:
> * Am I more than 20% behind the average progress?
> * Have I been running for at least a minute?
> * Have any tasks completed yet?
> Unfortunately, a set of reduce tasks which spend a couple of minutes in the Copy phase,
and very little time in the Sort phase, will trigger all these conditions for a large percentage
of the reduce tasks. (the tasks' progress jump from 33% to 66% almost instantly which then
triggers the speculation). I've seen this on several very large jobs which spend about 2 minutes
in Copy, a few seconds in Sort, and 40 minutes in Reduce. These jobs launch about 30-40% additional
reduce tasks which then run for almost the full 40 minutes. 
> This area becomes more plugable in MRv2 but for 1.0 it would be good if some portion
of this algorithm could be configurable so that a job could have some degree of control (just
disabling speculative execution is not really an option). 
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message