hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6078) Containers stuck in Localizing state
Date Thu, 09 Nov 2017 22:53:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246713#comment-16246713

Junping Du commented on YARN-6078:

Thanks [~billie.rinaldi] for updating the patch! A quick question here:
bq. The new patch only propagates the interrupt when a shell hasn't successfully been destroyed.
What's impact for {{super.interrupt();}} in case shell process get destroyed? Like you said,
it may prevent the rest of the cleanup from being performed for process destroying. Any side
effect if we totally skip this?

Other than this question, the patch looks good to me.

> Containers stuck in Localizing state
> ------------------------------------
>                 Key: YARN-6078
>                 URL: https://issues.apache.org/jira/browse/YARN-6078
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Billie Rinaldi
>         Attachments: YARN-6078.001.patch, YARN-6078.002.patch
> I encountered an interesting issue in one of our Yarn clusters (where the containers
are stuck in localizing phase).
> Our AM requests a container, and starts a process using the NMClient.
> According to the NM the container is in LOCALIZING state:
> {code}
> 1. 2017-01-09 22:06:18,362 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135)
- Container container_e03_1481261762048_0541_02_000060 transitioned from NEW to LOCALIZING
> 2017-01-09 22:06:18,363 [INFO] [AsyncDispatcher event handler] localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:711)
- Created localizer for container_e03_1481261762048_0541_02_000060
> 2017-01-09 22:06:18,364 [INFO] [LocalizerRunner for container_e03_1481261762048_0541_02_000060]
- Writing credentials to the nmPrivate file /../..//.nmPrivate/container_e03_1481261762048_0541_02_000060.tokens.
Credentials list:
> {code}
> According to the RM the container is in RUNNING state:
> {code}
> 2017-01-09 22:06:17,110 [INFO] [IPC Server handler 19 on 8030] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410)
- container_e03_1481261762048_0541_02_000060 Container Transitioned from ALLOCATED to ACQUIRED
> 2017-01-09 22:06:19,084 [INFO] [ResourceManager Event Processor] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410)
- container_e03_1481261762048_0541_02_000060 Container Transitioned from ACQUIRED to RUNNING
> {code}
> When I click the Yarn RM UI to view the logs for the container,  I get an error
> that
> {code}
> No logs were found. state is LOCALIZING
> {code}
> The Node manager 's stack trace seems to indicate that the NM's LocalizerRunner is stuck
waiting to read from the sub-process's outputstream.
> {code}
> "LocalizerRunner for container_e03_1481261762048_0541_02_000060" #27007081 prio=5 os_prio=0
tid=0x00007fa518849800 nid=0x15f7 runnable [0x00007fa5076c3000]
>    java.lang.Thread.State: RUNNABLE
> 	at java.io.FileInputStream.readBytes(Native Method)
> 	at java.io.FileInputStream.read(FileInputStream.java:255)
> 	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> 	- locked <0x00000000c6dc9c50> (a java.lang.UNIXProcess$ProcessPipeInputStream)
> 	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> 	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> 	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> 	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
> 	at java.io.InputStreamReader.read(InputStreamReader.java:184)
> 	at java.io.BufferedReader.fill(BufferedReader.java:161)
> 	at java.io.BufferedReader.read1(BufferedReader.java:212)
> 	at java.io.BufferedReader.read(BufferedReader.java:286)
> 	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:479)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
> 	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
> {code}
> I did a {code}ps aux{code} and confirmed that there was no container-executor process
running with INITIALIZE_CONTAINER that the localizer starts. It seems that the output stream
pipe of the process is still not closed (even though the localizer process is no longer present).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message