systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@googlemail.com>
Subject Re: Unexpected Executor Crash
Date Thu, 15 Jun 2017 19:36:00 GMT
well, I think Anthony already used -exec spark here; I would recommend to
(1) fix the driver configuration via --driver-java-options "-Xmn2500m" (we
assume that the young generation does not exceed 10% of the max heap
configuration) - this will help if the OOM comes from the driver, and (2)
potentially increase the memory overhead of the executors (--conf
spark.yarn.executor.memoryOverhead=10240) if ran on yarn and the node
manager kill the executor processes because they exceed the container
limits. If this does not help, please provide the -explain output and we
have a closer look.

Regards,
Matthias

On Thu, Jun 15, 2017 at 10:15 AM, Glenn Weidner <gweidner@us.ibm.com> wrote:

> Hi Anthony,
>
> Could you retry your scenario without the '-exec spark' option? By
> default, SystemML will run in hybrid_spark mode which is more efficient.
>
> Thanks,
> Glenn
>
>
> [image: Inactive hide details for Anthony Thomas ---06/15/2017 09:50:15
> AM---Hi SystemML Developers, I'm running the following simple D]Anthony
> Thomas ---06/15/2017 09:50:15 AM---Hi SystemML Developers, I'm running the
> following simple DML script under SystemML 0.14:
>
> From: Anthony Thomas <ahthomas@eng.ucsd.edu>
> To: dev@systemml.apache.org
> Date: 06/15/2017 09:50 AM
> Subject: Unexpected Executor Crash
> ------------------------------
>
>
>
> Hi SystemML Developers,
>
> I'm running the following simple DML script under SystemML 0.14:
>
> M = read('/scratch/M5.csv')
> N = read('/scratch/M5.csv')
> MN = M %*% N
> if (1 == 1) {
>    print(as.scalar(MN[1,1]))
> }
>
> The matrix M is square and about 5GB on disk (stored in HDFS). I am
> submitting the script to a 2 node spark cluster where each physical machine
> has 30GB of RAM. I am using the following command to submit the job:
>
> $SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G
> --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128
> --verbose --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> $SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain runtime
>
> However, I consistently run into errors like:
>
> ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC
> client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
> and the job eventually aborts. Consulting the output of executors shows
> they are crashing with OutOfMemory exceptions. Even if one executor needed
> to store M,N and MN in memory simultaneously it seems like there should be
> enough memory so I'm unsure why the executor is crashing. In addition, I
> was under the impression that Spark would spill to disk if there was
> insufficient memory. I've tried various combinations of
> increasing/decreasing the number of executor cores (from 1 to 8), using
> more/fewer executors, increasing/decreasing Spark's memoryFraction, and
> increasing/decreasing Spark's default parallelism all without success. Can
> anyone offer any advice or suggestions to debug this issue further? I'm not
> a very experienced Spark user so it's very possible I haven't configured
> something correctly. Please let me know if you'd like any further
> information.
>
> Best,
>
> Anthony Thomas
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message