hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-5103) With NM recovery enabled, restarting NM multiple times results in AM restart
Date Fri, 20 May 2016 15:44:12 GMT

     [ https://issues.apache.org/jira/browse/YARN-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Junping Du updated YARN-5103:
-----------------------------
    Attachment: YARN-5103.patch

It sounds a bit difficult to add unit test to cover case here - there are many objects need
to mock and RecoveredContainerLaunch's internal logic need to check pid path which is not
easily to mock (or we can change the logic there, but make code looks very tricky). 
I update the patch a bit given interrupted exception get wrapped up as InterruptedIOException
in HADOOP-12074.
[~jlowe], would you help to review it? Thanks!

> With NM recovery enabled, restarting NM multiple times results in AM restart
> ----------------------------------------------------------------------------
>
>                 Key: YARN-5103
>                 URL: https://issues.apache.org/jira/browse/YARN-5103
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Sumana Sathish
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-5103-demo.patch, YARN-5103.patch
>
>
> AM is restarted when NM is restarted multiple times even though NM recovery is enabled.
> {Code:title=NM log on which AM attempt 1 was running }
>  ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(88)) - Unable
to recover container container_e12_1463043063682_0002_01_000001
> java.io.IOException: java.lang.InterruptedException
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:579)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:487)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
> 	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:478)
> 	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerProcessAlive(LinuxContainerExecutor.java:542)
> 	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:185)
> 	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:445)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message