spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
Date Fri, 20 Mar 2015 14:58:39 GMT


Sean Owen commented on SPARK-5782:

Doesn't this make an RDD tens of billions of elements in the values, and just 5 keys? It seems
like the problem is the massive size of the value for each key, and that drives memory usage.

> Python Worker / Pyspark Daemon Memory Issue
> -------------------------------------------
>                 Key: SPARK-5782
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Shuffle
>    Affects Versions: 1.3.0, 1.2.1, 1.2.2
>         Environment: CentOS 7, Spark Standalone
>            Reporter: Mark Khaitman
>            Priority: Blocker
> I'm including the Shuffle component on this, as a brief scan through the code (which
I'm not 100% familiar with just yet) shows a large amount of memory handling in it:
> It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon
workers compared to the default 1 task -> 1 core configuration in our environment. This
can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons
do not cease to exist until the top level join is completed (or so it seems)... This can lead
to memory exhaustion by a single framework, even though is set to have a 512MB python worker
memory limit and few gigs of executor memory.
> Another related issue to this is that the individual python workers are not supposed
to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk.
> Some of our python workers are somehow reaching 2GB each (which when multiplied by the
number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory
killer to step up to its unfortunate job! :(
> I think with the _next_limit method in, if the current memory usage is close
to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed
by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up
towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case?
> I've only just started looking into the code, and would definitely love to contribute
towards Spark, though I figured it might be quicker to resolve if someone already owns the

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message