systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@googlemail.com>
Subject Re: Unexpected Executor Crash
Date Fri, 16 Jun 2017 19:15:51 GMT
this might indeed be an robustness issues of rmm which is a replication
based matrix multiply operator. I'll have a look. For the meantime, you can
increase your driver memory (you currently run w/ 1GB driver, resulting in
700MB local memory budget) to something like 10GB. This would allow a
broadcast-based matrix multiply operator (as the broadcast creation
requires twice the memory of a matrix, in your case 2.8GB).

Regards,
Matthias


On Fri, Jun 16, 2017 at 11:52 AM, Anthony Thomas <ahthomas@eng.ucsd.edu>
wrote:

> Hi Matthias and Glenn,
>
> Unfortunately I'm still running into this problem with executors crashing
> due to OOM. Here's the runtime plan generated by SystemML:
>
> 17/06/16 18:36:03 INFO DMLScript: EXPLAIN (RUNTIME):
>
> # Memory Budget local/remote = 628MB/42600MB/63900MB/3195MB
>
> # Degree of Parallelism (vcores) local/remote = 8/24
>
> PROGRAM ( size CP/SP = 17/4 )
>
> --MAIN PROGRAM
>
> ----GENERIC (lines 1-4) [recompile=true]
>
> ------CP createvar pREADM /scratch/M5.csv false MATRIX csv 18750 18750 -1
> -1 351562500 copy false , true 0.0
>
> ------CP createvar _mVar1 scratch_space//_p15260_172.31.3.116//_t0/temp0
> true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy
>
> ------SPARK csvrblk pREADM.MATRIX.DOUBLE _mVar1.MATRIX.DOUBLE 1000 1000
> false , true 0.0
>
> ------CP createvar _mVar2 scratch_space//_p15260_172.31.3.116//_t0/temp1
> true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy
>
> ------SPARK chkpoint _mVar1.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE
> MEMORY_AND_DISK
>
> ------CP rmvar _mVar1
>
> ------CP createvar _mVar3 scratch_space//_p15260_172.31.3.116//_t0/temp2
> true MATRIX binaryblock 18750 18750 1000 1000 -1 copy
>
> ------SPARK rmm _mVar2.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE
> _mVar3.MATRIX.DOUBLE
>
> ------CP rmvar _mVar2
>
> ------CP cpvar _mVar3 MN
>
> ------CP rmvar _mVar3
>
> ----GENERIC (lines 7-7) [recompile=true]
>
> ------CP createvar _mVar4 scratch_space//_p15260_172.31.3.116//_t0/temp3
> true MATRIX binaryblock 1 1 1000 1000 -1 copy
>
> ------SPARK rangeReIndex MN.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar4.MATRIX.DOUBLE
> NONE
>
> ------CP castdts _mVar4.MATRIX.DOUBLE.false _Var5.SCALAR.STRING
>
> ------CP rmvar _mVar4
>
> ------CP print _Var5.SCALAR.STRING.false _Var6.SCALAR.STRING
>
> ------CP rmvar _Var5
>
> ------CP rmvar _Var6
>
> ------CP rmvar MN
>
> ----GENERIC (lines 9-9) [recompile=false]
>
> ------CP print DONE!.SCALAR.STRING.true _Var7.SCALAR.STRING
>
> ------CP rmvar _Var7
>
>
> The actual error reports by the executor is:
>
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
> # Native memory allocation (mmap) failed to map 481296384 bytes for
> committing reserved memory.
>
>
> I can send my Spark and YARN configurations as well if that would be
> useful. Thanks a lot for your help.
>
>
> Best,
>
>
> Anthony
>
> On Thu, Jun 15, 2017 at 3:00 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
> wrote:
>
> > Thanks Matthias and Glenn,
> >
> > I'll give these suggestions a try once I get back in the office tomorrow.
> >
> > Best,
> >
> > Anthony
> >
> >
> > On Jun 15, 2017 12:36 PM, "Matthias Boehm" <mboehm7@googlemail.com>
> wrote:
> >
> > well, I think Anthony already used -exec spark here; I would recommend to
> > (1) fix the driver configuration via --driver-java-options "-Xmn2500m"
> (we
> > assume that the young generation does not exceed 10% of the max heap
> > configuration) - this will help if the OOM comes from the driver, and (2)
> > potentially increase the memory overhead of the executors (--conf
> > spark.yarn.executor.memoryOverhead=10240) if ran on yarn and the node
> > manager kill the executor processes because they exceed the container
> > limits. If this does not help, please provide the -explain output and we
> > have a closer look.
> >
> > Regards,
> > Matthias
> >
> > On Thu, Jun 15, 2017 at 10:15 AM, Glenn Weidner <gweidner@us.ibm.com>
> > wrote:
> >
> > > Hi Anthony,
> > >
> > > Could you retry your scenario without the '-exec spark' option? By
> > > default, SystemML will run in hybrid_spark mode which is more
> efficient.
> > >
> > > Thanks,
> > > Glenn
> > >
> > >
> > > [image: Inactive hide details for Anthony Thomas ---06/15/2017 09:50:15
> > > AM---Hi SystemML Developers, I'm running the following simple D]Anthony
> > > Thomas ---06/15/2017 09:50:15 AM---Hi SystemML Developers, I'm running
> > the
> > > following simple DML script under SystemML 0.14:
> > >
> > > From: Anthony Thomas <ahthomas@eng.ucsd.edu>
> > > To: dev@systemml.apache.org
> > > Date: 06/15/2017 09:50 AM
> > > Subject: Unexpected Executor Crash
> > > ------------------------------
> > >
> > >
> > >
> > > Hi SystemML Developers,
> > >
> > > I'm running the following simple DML script under SystemML 0.14:
> > >
> > > M = read('/scratch/M5.csv')
> > > N = read('/scratch/M5.csv')
> > > MN = M %*% N
> > > if (1 == 1) {
> > >    print(as.scalar(MN[1,1]))
> > > }
> > >
> > > The matrix M is square and about 5GB on disk (stored in HDFS). I am
> > > submitting the script to a 2 node spark cluster where each physical
> > machine
> > > has 30GB of RAM. I am using the following command to submit the job:
> > >
> > > $SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G
> > > --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128
> > > --verbose --conf
> > > spark.serializer=org.apache.spark.serializer.KryoSerializer
> > > $SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain runtime
> > >
> > > However, I consistently run into errors like:
> > >
> > > ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC
> > > client
> > > disassociated. Likely due to containers exceeding thresholds, or
> network
> > > issues. Check driver logs for WARN messages.
> > >
> > > and the job eventually aborts. Consulting the output of executors shows
> > > they are crashing with OutOfMemory exceptions. Even if one executor
> > needed
> > > to store M,N and MN in memory simultaneously it seems like there should
> > be
> > > enough memory so I'm unsure why the executor is crashing. In addition,
> I
> > > was under the impression that Spark would spill to disk if there was
> > > insufficient memory. I've tried various combinations of
> > > increasing/decreasing the number of executor cores (from 1 to 8), using
> > > more/fewer executors, increasing/decreasing Spark's memoryFraction, and
> > > increasing/decreasing Spark's default parallelism all without success.
> > Can
> > > anyone offer any advice or suggestions to debug this issue further? I'm
> > not
> > > a very experienced Spark user so it's very possible I haven't
> configured
> > > something correctly. Please let me know if you'd like any further
> > > information.
> > >
> > > Best,
> > >
> > > Anthony Thomas
> > >
> > >
> > >
> > >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message