hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Pendleton (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-152) Speculative tasks not being scheduled
Date Fri, 12 May 2006 21:49:10 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-152?page=comments#action_12383306 ] 

Bryan Pendleton commented on HADOOP-152:
----------------------------------------

*bump*

Is anyone else seeing this problem? My cluster is pretty unevenly loaded, and, without speculative
execution, I'm waiting for very long times for tasks to timeout on short jobs. Speculative
execution is enabled, so there's no reason that, say, two maps out of ~1900 should be holding
up execution. I suspect the "progress" accounting being done in the Job isn't being done correctly.

But, even then, perhaps we need more metrics - with the current metrics, if one of the job
units happens to be running really slowly on a given node, but might run faster on other nodes,
it might never get executed on another node because the progress on the slow node might be
reported as close enough to done so as to not trip the speculative execution.

> Speculative tasks not being scheduled
> -------------------------------------
>
>          Key: HADOOP-152
>          URL: http://issues.apache.org/jira/browse/HADOOP-152
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>     Versions: 0.2
>  Environment: ~30 node Opteron cluster
>     Reporter: Bryan Pendleton
>     Priority: Minor

>
> The criteria for starting up a speculative task includes a check that the "average progress"-"progress"
> the speculative gap, currently 0.2.
> I don't know if this is the right metric, but it doesn't seem to be correctly calculated.
I've regularly seen the "average progress" with values of less than 0.01, while the "progress"
value is showing in the range .90-1.0, and yet, still no speculative tasks are started up.
This has caused at least one long-running task to run about 10% longer while overloaded hosts
catch up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message