hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prabhu Joseph (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6078) Containers stuck in Localizing state
Date Thu, 02 Nov 2017 07:19:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235308#comment-16235308
] 

Prabhu Joseph commented on YARN-6078:
-------------------------------------

We have hit this issue recently. Below are the analysis

When the NodeManager is overloaded and ContainerLocalizer processes are hanging, the containers
will timeout and cleaned up. The LocalizerRunner thread will be interrupted during cleanup
but the interrupt does not work when it is reading from FileInputStream. LocalizerRunner threads
and ContainerLocalizer process keeps on accumulating which makes the node completely unresponsive.


There are below options which will help to avoid this:

1. ShellCommandExecutor parseExecResult currently uses blocking read() which can be changed
into below to use non blocking available() + sleep for some time.

{code}
while(running)
{
    if(in.available() > 0)
    {
        n = in.read(buffer);
        //do stuff with the buffer
    }
    else
    {
        Thread.sleep(500);
    }
}
{code}

2. Add a timeout for shell command similar to HADOOP-13817, timeout value can be set by AM
same as container timeout.


ContainerLocalizer JVM stacktrace:

{code}
"main" #1 prio=5 os_prio=0 tid=0x00007fd8ec019000 nid=0xc295 runnable [0x00007fd8f3956000]
   java.lang.Thread.State: RUNNABLE
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(ZipFile.java:219)
	at java.util.zip.ZipFile.<init>(ZipFile.java:149)
	at java.util.jar.JarFile.<init>(JarFile.java:166)
	at java.util.jar.JarFile.<init>(JarFile.java:103)
	at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893)
	at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
	at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
	at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
	at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:803)
	at sun.misc.URLClassPath$3.run(URLClassPath.java:530)
	at sun.misc.URLClassPath$3.run(URLClassPath.java:520)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
	at sun.misc.URLClassPath.getLoader(URLClassPath.java:492)
	- locked <0x000000076ac75058> (a sun.misc.URLClassPath)
	at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457)
	- locked <0x000000076ac75058> (a sun.misc.URLClassPath)
	at sun.misc.URLClassPath.getResource(URLClassPath.java:211)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	- locked <0x000000076ac7f960> (a java.lang.Object)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
{code}

NodeManager LocalizerRunner thread which is not interrupted:

{code}
"LocalizerRunner for container_e746_1508665985104_601806_01_000005" #3932753 prio=5 os_prio=0
tid=0x00007fb258d5f800 nid=0x11091 runnable [0x00007fb153946000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked <0x0000000718502b80> (a java.lang.UNIXProcess$ProcessPipeInputStream)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        - locked <0x0000000718502bd8> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        - locked <0x0000000718502bd8> (a java.io.InputStreamReader)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:930)
        at org.apache.hadoop.util.Shell.run(Shell.java:848)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
NM log shows the LocalizerRunner is suppose to 
{code}

> Containers stuck in Localizing state
> ------------------------------------
>
>                 Key: YARN-6078
>                 URL: https://issues.apache.org/jira/browse/YARN-6078
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jagadish
>            Priority: Major
>
> I encountered an interesting issue in one of our Yarn clusters (where the containers
are stuck in localizing phase).
> Our AM requests a container, and starts a process using the NMClient.
> According to the NM the container is in LOCALIZING state:
> {code}
> 1. 2017-01-09 22:06:18,362 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135)
- Container container_e03_1481261762048_0541_02_000060 transitioned from NEW to LOCALIZING
> 2017-01-09 22:06:18,363 [INFO] [AsyncDispatcher event handler] localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:711)
- Created localizer for container_e03_1481261762048_0541_02_000060
> 2017-01-09 22:06:18,364 [INFO] [LocalizerRunner for container_e03_1481261762048_0541_02_000060]
localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1191)
- Writing credentials to the nmPrivate file /../..//.nmPrivate/container_e03_1481261762048_0541_02_000060.tokens.
Credentials list:
> {code}
> According to the RM the container is in RUNNING state:
> {code}
> 2017-01-09 22:06:17,110 [INFO] [IPC Server handler 19 on 8030] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410)
- container_e03_1481261762048_0541_02_000060 Container Transitioned from ALLOCATED to ACQUIRED
> 2017-01-09 22:06:19,084 [INFO] [ResourceManager Event Processor] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410)
- container_e03_1481261762048_0541_02_000060 Container Transitioned from ACQUIRED to RUNNING
> {code}
> When I click the Yarn RM UI to view the logs for the container,  I get an error
> that
> {code}
> No logs were found. state is LOCALIZING
> {code}
> The Node manager 's stack trace seems to indicate that the NM's LocalizerRunner is stuck
waiting to read from the sub-process's outputstream.
> {code}
> "LocalizerRunner for container_e03_1481261762048_0541_02_000060" #27007081 prio=5 os_prio=0
tid=0x00007fa518849800 nid=0x15f7 runnable [0x00007fa5076c3000]
>    java.lang.Thread.State: RUNNABLE
> 	at java.io.FileInputStream.readBytes(Native Method)
> 	at java.io.FileInputStream.read(FileInputStream.java:255)
> 	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> 	- locked <0x00000000c6dc9c50> (a java.lang.UNIXProcess$ProcessPipeInputStream)
> 	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> 	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> 	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> 	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
> 	at java.io.InputStreamReader.read(InputStreamReader.java:184)
> 	at java.io.BufferedReader.fill(BufferedReader.java:161)
> 	at java.io.BufferedReader.read1(BufferedReader.java:212)
> 	at java.io.BufferedReader.read(BufferedReader.java:286)
> 	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:479)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
> 	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
> {code}
> I did a {code}ps aux{code} and confirmed that there was no container-executor process
running with INITIALIZE_CONTAINER that the localizer starts. It seems that the output stream
pipe of the process is still not closed (even though the localizer process is no longer present).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message