hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days
Date Wed, 24 Oct 2012 17:24:16 GMT

    [ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483413#comment-13483413
] 

Robert Joseph Evans commented on YARN-167:
------------------------------------------

Looking at the UI for one of the jobs that is stuck in this state and a heap dump for that
AM, I can see that the JOB is in KILL_WAIT and so are many of its tasks.  But for all of the
tasks in KILL_WAIT that I looked at the task attempts are all in FAILED, and none of them
failed because of a node that disappeared.  It looks very much like TaskImpl just need to
be able to handle T_ATTEMPT_FAILED and T_ATTEMPT_SUCCEEDED in the KILL_WAIT state, instead
of ignoring them.  I will look to see if this also exists in 2.0.  I think all we need to
do to reproduce this is to launch a large job that will have most of its tasks fail, and then
try to kill it before the job fails on its own.

This particular job had 2645 map tasks, 634 of them got stuck in KILL_WAIT, 1347 of them were
successfully killed and 623 of the tasks finished with a SUCCESS. This was running on a 2,000
node cluster.  The failed tasks appeared to take about 20 seconds before they failed, but
the last attempts to fail all ended within a second of each other.
                
> AM stuck in KILL_WAIT for days
> ------------------------------
>
>                 Key: YARN-167
>                 URL: https://issues.apache.org/jira/browse/YARN-167
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: TaskAttemptStateGraph.jpg
>
>
> We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING.
When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these
maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are
in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message