Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 202DFDCD3 for ; Sat, 10 Nov 2012 02:07:13 +0000 (UTC) Received: (qmail 49396 invoked by uid 500); 10 Nov 2012 02:07:12 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 49362 invoked by uid 500); 10 Nov 2012 02:07:12 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 49352 invoked by uid 99); 10 Nov 2012 02:07:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Nov 2012 02:07:12 +0000 Date: Sat, 10 Nov 2012 02:07:12 +0000 (UTC) From: "Vinod Kumar Vavilapalli (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <316611548.95255.1352513232865.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (MAPREDUCE-4751) AM stuck in KILL_WAIT for days MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494518#comment-13494518 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4751: ---------------------------------------------------- bq. Part of the issue is that the job is hanging around waiting for all tasks to be killed rather than just exiting and letting YARN shoot any straggling containers. I think it would be simpler/safer for the AM to just write out the final state stuff and exit, much like it does for the FAILED state. If job's KILL_WAIT really is necessary then we'd need a corresponding FAILED_WAIT state to handle waiting for task cleanup when a job fails. I agree. Sharad/I debated this for a while when we wrote this initially. We let it be like it is now, just to be sure that AM's sanely exit, but we can change it. The only catch I can think of is, while the AM tries to do the remaining cleanup work (jobhistory etc), tasks will keep on bombarding AM with more updates. Didn't realize that we don't have fail_wait state. The change isn't much bigger but it can break tests. Let's pursue that separately. The current bug is caused by Tasks waiting on TAs which should be fixed by my patch. Of course, it then opens up the job bug, let's fix that separately. Regarding doing away with Task's kill_wait, I disagree. Tasks can get kill signal during the AM is running, so we should handle it explicitly by killing and waiting for all attempts, otherwise we run the risk of dangling JVMs doing nothing but occupying slots till AM exits. > AM stuck in KILL_WAIT for days > ------------------------------ > > Key: MAPREDUCE-4751 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4751 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.23.3, 2.0.2-alpha > Reporter: Ravi Prakash > Assignee: Vinod Kumar Vavilapalli > Attachments: MAPREDUCE-4751-20121108.txt, MAPREDUCE-4751-20121109.txt, TaskAttemptStateGraph.jpg > > > We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira