hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Gupta (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-4749) Killing multiple attempts of a task taker longer as more attempts are killed
Date Sat, 03 Nov 2012 00:48:12 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Arpit Gupta updated MAPREDUCE-4749:

    Attachment: MAPREDUCE-4749.branch-1.patch

bq. Can there be multiple kill events for the same task or job? If so allCleanupActions could
be empty when there are still pending events. I don't think this can happen, but I want to
be sure about it.

allCleanUpActions is populated when ever user adds item to the queue and only gets removed
when a task/job is killed. So even when you are left with just tainted tasks it wont be empty.

And yes we could have multiple kill events for the same task/job before this patch, now we
make sure that is not the case.

bq. I don't think isJobLocalising throws an InterruptedException. and the javadocs for that
method are wrong.

Updated the javadoc and removed the throws exception.

bq. My other comment would be about the wait and notify. In this patch you have changed the
wait to be on the taskCleanupThread itself instead of rjob. It appears the no one will ever
notify the taskCleanupThread. So please either change the wait to a sleep, or add in a call
to taskCleanupThread.notifyAll() at about the same place that rjob.notifyAll() is happening.
As part of that too you will need to synchronize with the taskCleanupThread before calling
notifyAll. You will probably also want to synchronize around the wait, but be careful so you
get the locking order consistent between rjob and taskCleanupThread, or leave the two notify/lock
pairs separate which might be simpler.

I decided to change the wait to a Thread.sleep in the task clean up and removed the rjob.notifyAll

Also I had to restructure the code a bit so that i could write unit tests to cover various
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch, MAPREDUCE-4749.branch-1.patch, MAPREDUCE-4749.branch-1.patch
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 3 attempts
in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was accepted but
the time the attempt was actually killed was as stated above.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message