hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Konwinski (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2141) speculative execution start up condition based on completion time
Date Fri, 20 Feb 2009 09:49:01 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andy Konwinski updated HADOOP-2141:
-----------------------------------

    Attachment: HADOOP-2141-v5.patch

First, I found a significant bug in the current patch in the logic of isSlowTracker() that
turns sums of each taskTrackers tasks to averages. This attached updated patch contains the
bug fix.

Devaraj, regarding your suggestion of removing countSpeculating() in favor of having class
fields which maintain counts of running speculative map tasks and reduces, I agree that this
might perform better, and it is easy to increment this variable in the correct spot (i.e.
in getSpeculativeMap() and getSpeculativeReduce()), however it isn't as clear where to decrement
the counts. They need to be decremented when a speculative task is killed or completes, and
the code that manages this state transition seems to be convoluted since there are a number
of scenarios that are handled (failed task trackers, speculative task attempt succeeds, speculative
attempt is killed because original attempt succeeds). I am getting a little lost digging through
the code trying to figure out where these variables would need to be decremented at. There
is a comment in JobInProgress.completedTask() that says "TaskCommitThread in the JobTracker
marks other, completed, speculative tasks as _complete_." but I can't find the TaskCommitThread
that it references and I don't think that just adjusting the counts when speculative tasks
complete (as opposed to being killed or failing) would be enough. My vote is that we put this
off for now.

Regarding modifications to keep the sorted list of candidates around, one potential problem
I see with this is if a task that is cached in the sorted list of tips we are keeping around
finishes before we recompute the sorted list again, then there would be a possibility of speculating
a task that has already completed.

I have implemented your suggestion to keep a list of task trackers around, and have set the
time to 2 minutes (using the SLOW_TRACKER_SORT_DELAY constant).

One thing that I think is important is to test the effects of this patch on MapReduce performance
since a lot of the code base has changed and also this patch is quite different than the one
we used for the experiments in the OSDI paper.

Finally, I wanted to double check with Devaraj that you didn't add any new functionality or
bug fixes in your patch, but instead that it was just merging with trunk (and putting the
default values for the parameters in mapred-default.xml instead of hadoop-default.xml). In
particular I noticed some properties that your patch adds to mapred-default.xml that don't
seem to be related to this JIRA or used in the rest of the patch (e.g. mapred.shuffle.maxFetchPerHost).
Were these included intentionally?

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, HADOOP-2141-v4.patch,
HADOOP-2141-v5.patch, HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative instance of a
task is that it must be at least 20% behind the average progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for tasks in the
speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message