hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time
Date Fri, 10 Apr 2009 03:51:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697724#action_12697724
] 

Devaraj Das commented on HADOOP-2141:
-------------------------------------

Andy, the current patch doesn't apply unless the the fuzz factor is set to 3 - "patch -p0
  -F 3 < HADOOP-2141-v6.patch". There is a NPE, and you can reproduce that if you run the
test TestMiniMRDFSSort - "ant -Dtestcase=TestMiniMRDFSSort test -Dtest.output=yes", in the
heartbeat method and the test never comes out since the TTs continues to resend the heartbeat
forever. The NPE comes from isSlowTracker method. Looking more closely at the isSlowTracker
method, i think that requires some rework. The isSlowTracker method currently looks at progress
rates of only the running TIPs (although you do a check for TaskStatus.State.SUCCEEDED but
this would be always false for RUNNING tips, and that is what is passed to the method) and
attaches that to the TaskTrackers that are running them. But wouldn't you want to look at
the history, i.e., successful TIPs that ran on the TaskTrackers. 
I am thinking that it would make sense to give one credit to a TT upon running a task successfully
and base isSlowTracker purely on that (rather than the running tasks).. That way, even the
TT's progress can be maintained inline and you wouldn't have to iterate over the running TIPs
and compute that upon a TT heartbeat.. Thoughts?

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, HADOOP-2141-v4.patch,
HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative instance of a
task is that it must be at least 20% behind the average progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for tasks in the
speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message