hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Badger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5641) Localizer leaves behind tarballs after container is complete
Date Fri, 16 Sep 2016 13:57:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496398#comment-15496398
] 

Eric Badger commented on YARN-5641:
-----------------------------------

[~jlowe] and I worked on this for some time yesterday and killing the spawned untar shell
process is proving to be very difficult. The localizer spawns up the untar shell thread, which
invokes a shell exec untar command. Once the container is killed, the next time the localizer
heartbeats to the NM, it will be instructed to die. Inside of the 'die' codepath, the localizer
interrupts all of its spawned threads using the cancel() method. However, the untar thread
is stuck inside of file I/O waiting to parse the result of the shell execution and is uninterruptible.
The untar thread won't get the InterruptedException until it is finished, and so we cannot
kill it or the untar shell exec before it completes. We can have the localizer process wait
for the untar thread to end via awaitTermination() (currently it only uses shutdownNow()),
but it won't return until untar finishes on its own, since shutdown() won't have any effect
with interrupting the untar thread. 

I tested this by replacing the untar shell command with a sleep command so that there would
be no worry about the untar actually finishing. The container was killed and instructed to
die after the subsequent NM heartbeat. Then it attempted to shutdown all of its threads, but
the untar thread would sit in readBytes instead of getting the InterruptedException. Below
is the stack trace of the untar thread just after the localizer calls shutdown(). It never
gets the InterruptedException and sits in this stack trace until awaitTermination hits its
timeout and the localizer kills the JVM. Since we never catch the InterruptedException, we
are unable to destroy the untar shell process and it continues to run after the localizer
and untar thread are killed (it became owned by init). 

{noformat}
"ContainerLocalizer Downloader" #19 prio=5 os_prio=0 tid=0x00007f4315169800 nid=0x1530 runnable
[0x00007f42f5217000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileInputStream.readBytes(Native Method)
	at java.io.FileInputStream.read(FileInputStream.java:255)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	- locked <0x000000076f4fca28> (a java.lang.UNIXProcess$ProcessPipeInputStream)
	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	- locked <0x000000076f506cf8> (a java.io.InputStreamReader)
	at java.io.InputStreamReader.read(InputStreamReader.java:184)
	at java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.io.BufferedReader.read1(BufferedReader.java:212)
	at java.io.BufferedReader.read(BufferedReader.java:286)
	- locked <0x000000076f506cf8> (a java.io.InputStreamReader)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
	at org.apache.hadoop.util.Shell.run(Shell.java:479)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
	at org.apache.hadoop.fs.FileUtil.unTarUsingTar(FileUtil.java:682)
	at org.apache.hadoop.fs.FileUtil.unTar(FileUtil.java:651)
	at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:283)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

> Localizer leaves behind tarballs after container is complete
> ------------------------------------------------------------
>
>                 Key: YARN-5641
>                 URL: https://issues.apache.org/jira/browse/YARN-5641
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>
> The localizer sometimes fails to clean up extracted tarballs leaving large footprints
that persist on the nodes indefinitely. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message