spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlile, Ken" <>
Subject Limit pyspark.daemon threads
Date Thu, 17 Mar 2016 14:43:49 GMT

We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts
I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework.
Nodes in the cluster have 16 cores and 128GB of RAM. 

My users use pyspark heavily. We’ve been having a number of problems with nodes going offline
with extraordinarily high load. I was able to look at one of those nodes today before it went
truly sideways, and I discovered that the user was running 50 pyspark.daemon threads (remember,
this is a 16 core box), and the load was somewhere around 25 or so, with all CPUs maxed out
at 100%. 

So while the spark worker is aware it’s only got 16 cores and behaves accordingly, pyspark
seems to be happy to overrun everything like crazy. Is there a global parameter I can use
to limit pyspark threads to a sane number, say 15 or 16? It would also be interesting to set
a memory limit, which leads to another question. 

How is memory managed when pyspark is used? I have the spark worker memory set to 90GB, and
there is 8GB of system overhead (GPFS caching), so if pyspark operates outside of the JVM
memory pool, that leaves it at most 30GB to play with, assuming there is no overhead outside
the JVM’s 90GB heap (ha ha.)

Ken Carlile
Sr. Unix Engineer
HHMI/Janelia Research Campus

View raw message