hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4749) Killing multiple attempts of a task taker longer as more attempts are killed
Date Fri, 02 Nov 2012 16:47:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489534#comment-13489534

Robert Joseph Evans commented on MAPREDUCE-4749:

It has been a while for me in this code so feel free to correct me if I am wrong about something.
 The changes look good to me but the TT is huge and I have not looked at it in that much depth.
 Can there be multiple kill events for the same task or job?  If so allCleanupActions could
be empty when there are still pending events.  I don't think this can happen, but I want to
be sure about it.

I don't think isJobLocalising throws an InterruptedException. and the javadocs for that method
are wrong.

My other comment would be about the wait and notify. In this patch you have changed the wait
to be on the taskCleanupThread itself instead of rjob.  It appears the no one will ever notify
the taskCleanupThread.  So please either change the wait to a sleep, or add in a call to taskCleanupThread.notifyAll()
at about the same place that rjob.notifyAll() is happening. As part of that too you will need
to synchronize with the taskCleanupThread before calling notifyAll.  You will probably also
want to synchronize around the wait, but be careful so you get the locking order consistent
between rjob and taskCleanupThread, or leave the two notify/lock pairs separate which might
be simpler.    
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch, MAPREDUCE-4749.branch-1.patch
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 3 attempts
in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was accepted but
the time the attempt was actually killed was as stated above.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message