Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 92E5F9985 for ; Fri, 21 Oct 2011 19:42:52 +0000 (UTC) Received: (qmail 67032 invoked by uid 500); 21 Oct 2011 19:42:52 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 67004 invoked by uid 500); 21 Oct 2011 19:42:52 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 66992 invoked by uid 99); 21 Oct 2011 19:42:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2011 19:42:52 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2011 19:42:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 125B3315969 for ; Fri, 21 Oct 2011 19:40:32 +0000 (UTC) Date: Fri, 21 Oct 2011 19:40:32 +0000 (UTC) From: "Robert Joseph Evans (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1841211396.2720.1319226032076.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1503650769.13273.1318582812767.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132982#comment-13132982 ] Robert Joseph Evans commented on MAPREDUCE-3186: ------------------------------------------------ Yes RM-restart is not that common, but when it does I don't want to have to ssh to every one of our 4000 nodes in a cluster and try to kill off all of the running AppMasters. Even with scripts that can be painful. What about other containers? Are they also not killed off when the NM exits? > User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed. > ------------------------------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-3186 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Environment: linux > Reporter: Ramgopal N > Assignee: Eric Payne > Labels: test > > If the resource manager is restarted while the job execution is in progress, the job is getting hanged. > UI shows the job as running. > In the RM log, it is throwing an error "ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001" > In the console MRAppMaster and Runjar processes are not getting killed -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira