hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12317) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST
Date Thu, 20 Aug 2015 12:47:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704801#comment-14704801
] 

Hudson commented on HADOOP-12317:
---------------------------------

ABORTED: Integrated in Hadoop-Hdfs-trunk #2220 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2220/])
HADOOP-12317. Applications fail on NM restart on some linux distro because NM container recovery
declares AM container as LOST (adhoot via rkanter) (rkanter: rev 1e06299df82b98795124fe8a33578c111e744ff4)
* hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
* hadoop-common-project/hadoop-common/CHANGES.txt
* hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestShell.java


> Applications fail on NM restart on some linux distro because NM container recovery declares
AM container as LOST
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-12317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>            Priority: Critical
>             Fix For: 2.8.0
>
>         Attachments: YARN-4046.002.patch, YARN-4046.002.patch, YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail because the
signal syntax for process group may not work. We see errors in checking if process is alive
during container recovery which causes the container to be declared as LOST (154) on a NodeManager
restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt recovered after
RM restartAM Container for appattempt_1439244348718_0001_000001 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message