spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Khaitman (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
Date Thu, 12 Feb 2015 19:52:12 GMT
Mark Khaitman created SPARK-5782:
------------------------------------

             Summary: Python Worker / Pyspark Daemon Memory Issue
                 Key: SPARK-5782
                 URL: https://issues.apache.org/jira/browse/SPARK-5782
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Shuffle
    Affects Versions: 1.2.1, 1.3.0, 1.2.2
         Environment: CentOS 7, Spark Standalone
            Reporter: Mark Khaitman


I'm including the Shuffle component on this, as a brief scan through the code (which I'm not
100% familiar with just yet) shows a large amount of memory handling in it:

It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers
compared to the default 1 task -> 1 core configuration in our environment. This can become
problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons
do not cease to exist until the top level join is completed (or so it seems)... This can lead
to memory exhaustion by a single framework, even though is set to have a 512MB python worker
memory limit and few gigs of executor memory.

Another related issue to this is that the individual python workers are not supposed to even
exceed that far beyond 512MB, otherwise they're supposed to spill to disk.

I came across this bit of code in shuffle.py which *may* have something to do with allowing
some of our python workers from somehow reaching 2GB each (which when multiplied by the number
of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory
killer to step up to its unfortunate job! :(

def _next_limit(self):
        """
        Return the next memory limit. If the memory is not released
        after spilling, it will dump the data only when the used memory
        starts to increase.
        """
        return max(self.memory_limit, get_used_memory() * 1.05)


I've only just started looking into the code, and would definitely love to contribute towards
Spark, though I figured it might be quicker to resolve if someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message