hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4751) AM stuck in KILL_WAIT for days
Date Fri, 09 Nov 2012 20:43:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494295#comment-13494295

Jason Lowe commented on MAPREDUCE-4751:

Part of the issue is that the job is hanging around waiting for all tasks to be killed rather
than just exiting and letting YARN shoot any straggling containers.  I think it would be simpler/safer
for the AM to just write out the final state stuff and exit, much like it does for the FAILED
state.  If job's KILL_WAIT really is necessary then we'd need a corresponding FAILED_WAIT
state to handle waiting for task cleanup when a job fails.

If we don't need the job's KILL_WAIT state then similarly we can probably ditch the task KILL_WAIT
state -- it could just send off kills to all the corresponding task attempts and sit in the
KILLED state.  Does it really need to wait?

Removing KILL_WAIT is quite a bit bigger change than the current one. as it would break a
lot of tests that know and expect the KILL_WAIT state.  However I think it would be more robust
in the long-term, as KILL_WAIT seems like a state primed for hanging if we don't really need
it.  Since we're eager to get a fix for this in soon we could address that in a followup JIRA.
> AM stuck in KILL_WAIT for days
> ------------------------------
>                 Key: MAPREDUCE-4751
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4751
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.2-alpha
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: MAPREDUCE-4751-20121108.txt, TaskAttemptStateGraph.jpg
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING.
When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these
maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message