Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 987AF189FF for ; Thu, 20 Aug 2015 02:02:46 +0000 (UTC) Received: (qmail 89862 invoked by uid 500); 20 Aug 2015 02:02:46 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 89810 invoked by uid 500); 20 Aug 2015 02:02:46 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 89796 invoked by uid 99); 20 Aug 2015 02:02:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Aug 2015 02:02:46 +0000 Date: Thu, 20 Aug 2015 02:02:46 +0000 (UTC) From: "Robert Kanter (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HADOOP-12317) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated HADOOP-12317: ----------------------------------- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Thanks for the fix Anubhav. Committed to trunk and branch-2! > Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST > ---------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-12317 > URL: https://issues.apache.org/jira/browse/HADOOP-12317 > Project: Hadoop Common > Issue Type: Bug > Reporter: Anubhav Dhoot > Assignee: Anubhav Dhoot > Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4046.002.patch, YARN-4046.002.patch, YARN-4096.001.patch > > > On a debian machine we have seen node manager recovery of containers fail because the signal syntax for process group may not work. We see errors in checking if process is alive during container recovery which causes the container to be declared as LOST (154) on a NodeManager restart. > The application will fail with error. The attempts are not retried. > {noformat} > Application application_1439244348718_0001 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1439244348718_0001_000001 exited with exitCode: 154 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)