hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Konwinski (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2141) speculative execution start up condition based on completion time
Date Thu, 07 May 2009 10:24:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706790#action_12706790
] 

Andy Konwinski commented on HADOOP-2141:
----------------------------------------

The current patch contains the changes discussed (see my responses below).

2.  We are now using the the task dispatch time from the JT as the base time to estimate progress
so that the time estimates are accurate and also account for potential laggard behavior of
a node due to network problems/latency.

5. I've put locality preference back in for speculative maps.

6. I implemented isSlowTracker as I described above, with the number of standard deviations
that a TT has to be below the global average specified in the conf file (with a default of
1 std).

7. I just removed this filter, we now allow speculation if there is more than one task.

* Also, I changed the behavior of the filter that only allows tasks that have run for more
than a minute to be speculated. To do this for now, I've set it to 0, which means that tasks
aren't being filtered, but this way we can keep an eye out while testing and easily turn it
back on if we want the filter back. I think it is just a remnant of the original speculative
execution heuristic.

I have been testing this patch on small sort jobs on a 10 node EC2 cluster for a couple of
days now. I've been simulating laggards by running nice -n -20 ruby -e "while true;;end" loops
also dd if=/dev/zero of=/tmp/tmpfile bs=100000. Hopefully large scale testing will flush out
any bugs I've missed.

Other thoughts and some ideas for near term future work:

* As we've talked about some already, after this patch gets tested and committed, we should
update the way we calculate task progress, probably normalizing by data input size to task.
Also, we might think about using only the first two phases of the reduce tasks to estimate
the performance of Task Trackers because we know more about their behavior.

* We should further improve isSlowTracker() with regards to how we handle Task Trackers that
have not reported any successful tasks for this job. Right now if a TT is a laggard and 1)
is really slow, or 2) was added to the cluster near the end of a job, or 3) the job is smaller
than the cluster size and is thus spreading out its tasks thinly; then the task tracker might
not have reported a successful task by the time we start looking to run speculative tasks.
In this case we don't know if the task tracker is a laggard since we use a TT's history to
determine if it is slow or not. Currently, we just assume it might be a laggard and thus isSlowTracker()
will return true. In the near future it will be better to allow assignment of a spec task
to a TT if:
1) the TT has run at least one successful task for this job already and it's average task
duration is less than slowNodeThreshold standard deviations below the average task duration
of all completed tasks for this job.
2) if the TT has not been assigned any tasks for this job yet (i.e. has been assigned a task
for this job bus the task has not completed yet)

* Finally, we might want to think up some unit test cases for speculative execution.

> speculative execution start up condition based on completion time
> -----------------------------------------------------------------
>
>                 Key: HADOOP-2141
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2141
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.21.0
>            Reporter: Koji Noguchi
>            Assignee: Andy Konwinski
>         Attachments: 2141.patch, HADOOP-2141-v2.patch, HADOOP-2141-v3.patch, HADOOP-2141-v4.patch,
HADOOP-2141-v5.patch, HADOOP-2141-v6.patch, HADOOP-2141.patch, HADOOP-2141.v7.patch
>
>
> We had one job with speculative execution hang.
> 4 reduce tasks were stuck with 95% completion because of a bad disk. 
> Devaraj pointed out 
> bq . One of the conditions that must be met for launching a speculative instance of a
task is that it must be at least 20% behind the average progress, and this is not true here.
> It would be nice if speculative execution also starts up when tasks stop making progress.
> Devaraj suggested 
> bq. Maybe, we should introduce a condition for average completion time for tasks in the
speculative execution check. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message