hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10622) Shell.runCommand can deadlock
Date Tue, 20 May 2014 21:21:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003987#comment-14003987
] 

Jason Lowe commented on HADOOP-10622:
-------------------------------------

Saw this while running the TestNodeManagerResync.testKillContainersOnResync unit test, although
the nature of the deadlock looks like it could happen in other scenarios as well.

{noformat}
Found one Java-level deadlock:
=============================
"Thread-163":
  waiting to lock monitor 0x00007f4e38086b60 (object 0x00000000ebab1508, a java.lang.UNIXProcess$ProcessPipeInputStream),
  which is held by "LocalizerRunner for container_0_0000_01_000000"
"LocalizerRunner for container_0_0000_01_000000":
  waiting to lock monitor 0x00007f4e380855b8 (object 0x00000000ebab3620, a java.io.InputStreamReader),
  which is held by "Thread-163"

Java stack information for the threads listed above:
===================================================
"Thread-163":
	at java.io.BufferedInputStream.read(BufferedInputStream.java:325)
	- waiting to lock <0x00000000ebab1508> (a java.lang.UNIXProcess$ProcessPipeInputStream)
	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
	- locked <0x00000000ebab3620> (a java.io.InputStreamReader)
	at java.io.InputStreamReader.read(InputStreamReader.java:184)
	at java.io.BufferedReader.fill(BufferedReader.java:154)
	at java.io.BufferedReader.readLine(BufferedReader.java:317)
	- locked <0x00000000ebab3620> (a java.io.InputStreamReader)
	at java.io.BufferedReader.readLine(BufferedReader.java:382)
	at org.apache.hadoop.util.Shell$1.run(Shell.java:506)
"LocalizerRunner for container_0_0000_01_000000":
	at java.io.BufferedReader.close(BufferedReader.java:515)
	- waiting to lock <0x00000000ebab3620> (a java.io.InputStreamReader)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:574)
	- locked <0x00000000ebab1508> (a java.lang.UNIXProcess$ProcessPipeInputStream)
	at org.apache.hadoop.util.Shell.run(Shell.java:452)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:684)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:773)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:756)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:288)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1012)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:666)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:662)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:662)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1105)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)

Found 1 deadlock.
{noformat}

Shell.runCommand has a lock on the stderr InputStream and is trying to call close() on it
while the errThread it spawned earlier in the method is trying to read from the same stream.
 The method tries to join with the errThread before closing, but it appears this was aborted
by an InterruptedException in the case where it deadlocked (probably because the container
was being killed in the unit test).  Here's the relevant snippet from the unit test log showing
the method being interrupted:

{noformat}
2014-05-20 20:48:40,053 INFO  [Thread-162] nodemanager.NodeManager (NodeManager.java:run(262))
- Cleaning up running containers on resync
2014-05-20 20:48:40,053 INFO  [Thread-162] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanupContainersOnNMResync(376))
- Containers still running on ON_NODEMANAGER_RESYNC : [container_0_0000_01_000000]
2014-05-20 20:48:40,053 INFO  [Thread-162] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanupContainersOnNMResync(383))
- Waiting for containers to be killed
2014-05-20 20:48:40,054 INFO  [AsyncDispatcher event handler] container.Container (ContainerImpl.java:handle(901))
- Container container_0_0000_01_000000 transitioned from LOCALIZING to KILLING
2014-05-20 20:48:40,057 WARN  [LocalizerRunner for container_0_0000_01_000000] util.Shell
(Shell.java:runCommand(533)) - Interrupted while reading the error stream
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1260)
	at java.lang.Thread.join(Thread.java:1334)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:531)
	at org.apache.hadoop.util.Shell.run(Shell.java:452)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:684)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:773)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:756)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:288)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1012)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:351)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:666)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:662)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:662)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1105)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)
{noformat}

It looks like we need to either be a little more persistent in trying to join with the errThread
before entering the finally block where we lock and try to close the input stream, or we need
to rethink the locking scheme that was added in HADOOP-10146.

> Shell.runCommand can deadlock
> -----------------------------
>
>                 Key: HADOOP-10622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10622
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> Ran into a deadlock in Shell.runCommand.  Stacktrace details to follow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message