spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-11655) SparkLauncherBackendSuite leaks child processes
Date Wed, 11 Nov 2015 18:27:11 GMT


Apache Spark reassigned SPARK-11655:

    Assignee: Apache Spark

> SparkLauncherBackendSuite leaks child processes
> -----------------------------------------------
>                 Key: SPARK-11655
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.6.0
>            Reporter: Josh Rosen
>            Assignee: Apache Spark
>            Priority: Blocker
>         Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
> We've been combatting an orphaned process issue on AMPLab Jenkins since October and I
finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to get the
full launch commands for the hanging orphaned processes. It looks like they're all running
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
-Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that these leaks
started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when was merged,
which added LauncherBackendSuite. The launch arguments used in this suite seem to line up
with the arguments that I observe in the hanging processes' {{jps}} output:
> Interestingly, Jenkins doesn't show test timing or output for this suite! I think that
what might be happening is that we have a mixed Scala/Java package, so maybe the two test
runner XML files aren't being merged properly:,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating a zombie
SparkSubmit process! I think that what's happening is that the launcher's {{handle.kill()}}
call ends up destroying the bash {{spark-submit}} subprocess such that its child process (a
JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when launching
a child JVM from a Python / Bash process: connect it to a socket or stream such that it can
detect its parent's death and clean up after itself appropriately.
> /cc [~shaneknapp] and [~vanzin].

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message