hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE
Date Wed, 02 Mar 2016 18:33:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176186#comment-15176186
] 

Jason Lowe commented on YARN-4744:
----------------------------------

bq. Can we use similar check like LinuxContainerExecutor#isContainerAlive(ContainerLivenessContext
ctx).

That function is implemented in terms of signalContainer (so we have the same issue), and
the process could exit between the check and the subsequent kill attempt.

bq. My feeling is that the PrivilegedOperationExecutor should log failures irrespective of
the error code

There's always going to be a race where a container can exit before it gets killed, and I'm
not sure we accomplish much besides alarming users when we log warnings when that occurs.
 IMHO PrivilegedOperationExecutor should not be the one that decides what should and shouldn't
be logged, since it doesn't have any context on whether the error is severe enough to warrant
it.  Instead I think we should ensure the same data is present in the PrivilegedOperationException
and let the code handling that error perform the logging if it is appropriate to do so.


> Too many signal to container failure in case of LCE
> ---------------------------------------------------
>
>                 Key: YARN-4744
>                 URL: https://issues.apache.org/jira/browse/YARN-4744
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.9.0
>            Reporter: Bibin A Chundatt
>            Assignee: Sidharta Seethana
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
Authorization successful for testing (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Event EventType: KILL_CONTAINER sent to absent container container_e02_1393731146548_0001_01_000013
> 2014-03-02 09:20:43,071 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container container_e02_1393731146548_0001_01_000009 succeeded
> 2014-03-02 09:20:43,072 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_e02_1393731146548_0001_01_000009 transitioned from RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_e02_1393731146548_0001_01_000009
> 2014-03-02 09:20:43,075 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, yarn,
yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
ExitCodeException exitCode=9:
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
>         at org.apache.hadoop.util.Shell.run(Shell.java:838)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
>         ... 9 more
> 2014-03-02 09:20:43,113 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
USER=yarn OPERATION=Container Finished - Succeeded        TARGET=ContainerImpl    RESULT=SUCCESS
 APPID=application_1393731146548_0001    CONTAINERID=container_e02_1393731146548_0001_01_000009
> 2014-03-02 09:20:43,115 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_e02_1393731146548_0001_01_000009 transitioned from EXITED_WITH_SUCCESS
to DONE
> 2014-03-02 09:20:43,115 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
Removing container_e02_1393731146548_0001_01_000009 from application application_1393731146548_0001
> {noformat}
> Checked the same scenario in 2.7.2 version (not available)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message