hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Konwinski (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time
Date Mon, 27 Apr 2009 09:49:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703048#action_12703048
] 

Andy Konwinski commented on HADOOP-2141:
----------------------------------------

Hi Devaraj, Thanks for the code review. I have a lot of your comments implemented already
and am working on the more significant ones still, should have a new patch ready in the next
2 to 3 days. Before then, though I wanted to post an update on my progress and respond to
some of your suggestions to allow for wider input. Patch to come soon!

1. Done

2.  The problem with this is that we want to be able to identify laggard tasks even when they
are not reporting progress. I.e. if we don't get a TaskStatus update for that task from the
TT (perhaps because the TT is down, or the task is hanging) we want it to appear slower as
time goes on from the JT's perspective.

3. Currently it is a field because I am only recalculating the task tracker ordering every
SLOW_TRACKER_SORT_DELAY minutes (currently set to 2 min) so we have to keep the progress rate
scores around between those sorts. However, since I'm rewriting isSlowTracker() anyway (see
6 below) this is no longer relevant.

4. Done

5. Done

6. I've spoken with Matei about this and we've decided that the mean and variance (i.e. is
the ave ProgressRate of tasks that finished on the tracker less than a standard deviation
below the ave ProgressRate of tasks on other trackers) to determine if a TaskTracker is slow
is much better than using a percentile. The current plan is to create new class, DataStatistics,
used to track statistics for a set of numbers (by storing count, sum, and sum of squares).
DataStatistics will provide mean() and std() functions. The object will be used at two levels:
* a field of JobInProgress, taskStats, for tracking stats of all tasks
* a map field of JobInProgress, trackerStats, with key of type TaskTracker name, value of
type DataStatistics

Updating of the statistics data structures above will be a constant time operation done when
TaskTrackers report tasks as complete.

All of this makes isSlowTracker() really simple. Basically it consists of:
if (trackerStats.get(taskTracker).mean() < taskStats.mean() - taskStats.std()) { return
true; }

7. Let's use a percentage instead (10%?)

-------
One other comment: while discussing 2 above with Devaraj and Matei, we think it is important
to more closely consider the mechanism used to calculate a task's progress rate. The mechanism
we're using in the patch so far (i.e., using task's (progress/currentTime - startTime)) which
can be seen in TaskStatus.updateProgressRate, might be improved by looking more closely at
how to normalize the amount of time the task has been running by the amount of data it has
processed (potentially phase-wise). When Matei and I wrote the original LATE paper, we didn't
dig very deep into the task progress reporting mechanisms, but rather just used the progress
as it was reported, while making note of some of the oddities re. the three phases. I am still
trying to validate for myself how closely the progress as reported by tasks to the TaskTracker
reflects the amount of data processed thus far. However, pending a deeper look into this,
it might be advantageous to revisit the progressRate mechanism after we commit a simple version
of the patch which uses progressRate as is (assuming that testing at scale shows performance
improvements). 

Again, the patch will be up in the next few of days.
Andy

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, HADOOP-2141-v4.patch,
HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, HADOOP-2141.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative instance of a
task is that it must be at least 20% behind the average progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for tasks in the
speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message