systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@googlemail.com>
Subject Re: Unexpected Executor Crash
Date Sun, 18 Jun 2017 00:47:03 GMT
thank you both for the feedback - this helps because large squared matrix
multiply was not a common workload for us so far. I now have a fix for this
issue, which should be in master tomorrow.

The major improvements are (1) shallow block replicas, (2) adjustments of
the number of partitions according to estimated intermediate size, and (3)
a better hash function to avoid load imbalance. Together these changes
bound memory requirements per partition, ensure better parallelism, and
improved performance by more than 3x (2x end-to-end w/ csv read, spark
context creation, and collect).

Regards,
Matthias


On Sat, Jun 17, 2017 at 1:40 AM, arijit chakraborty <akc14@hotmail.com>
wrote:

> Hi Matthias,
>
>
> We are also facing the similar issue as Anthony. I was not sure if I need
> to open a new issue, so adding my concern here.
>
>
> We are using executor memory as "15GB", but not putting any driver memory.
>
>
> Thank you!
>
> Arijit
>
> ________________________________
> From: Matthias Boehm <mboehm7@googlemail.com>
> Sent: Saturday, June 17, 2017 12:45:51 AM
> To: dev@systemml.apache.org
> Subject: Re: Unexpected Executor Crash
>
> this might indeed be an robustness issues of rmm which is a replication
> based matrix multiply operator. I'll have a look. For the meantime, you can
> increase your driver memory (you currently run w/ 1GB driver, resulting in
> 700MB local memory budget) to something like 10GB. This would allow a
> broadcast-based matrix multiply operator (as the broadcast creation
> requires twice the memory of a matrix, in your case 2.8GB).
>
> Regards,
> Matthias
>
>
> On Fri, Jun 16, 2017 at 11:52 AM, Anthony Thomas <ahthomas@eng.ucsd.edu>
> wrote:
>
> > Hi Matthias and Glenn,
> >
> > Unfortunately I'm still running into this problem with executors crashing
> > due to OOM. Here's the runtime plan generated by SystemML:
> >
> > 17/06/16 18:36:03 INFO DMLScript: EXPLAIN (RUNTIME):
> >
> > # Memory Budget local/remote = 628MB/42600MB/63900MB/3195MB
> >
> > # Degree of Parallelism (vcores) local/remote = 8/24
> >
> > PROGRAM ( size CP/SP = 17/4 )
> >
> > --MAIN PROGRAM
> >
> > ----GENERIC (lines 1-4) [recompile=true]
> >
> > ------CP createvar pREADM /scratch/M5.csv false MATRIX csv 18750 18750 -1
> > -1 351562500 copy false , true 0.0
> >
> > ------CP createvar _mVar1 scratch_space//_p15260_172.31.3.116//_t0/temp0
> > true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy
> >
> > ------SPARK csvrblk pREADM.MATRIX.DOUBLE _mVar1.MATRIX.DOUBLE 1000 1000
> > false , true 0.0
> >
> > ------CP createvar _mVar2 scratch_space//_p15260_172.31.3.116//_t0/temp1
> > true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy
> >
> > ------SPARK chkpoint _mVar1.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE
> > MEMORY_AND_DISK
> >
> > ------CP rmvar _mVar1
> >
> > ------CP createvar _mVar3 scratch_space//_p15260_172.31.3.116//_t0/temp2
> > true MATRIX binaryblock 18750 18750 1000 1000 -1 copy
> >
> > ------SPARK rmm _mVar2.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE
> > _mVar3.MATRIX.DOUBLE
> >
> > ------CP rmvar _mVar2
> >
> > ------CP cpvar _mVar3 MN
> >
> > ------CP rmvar _mVar3
> >
> > ----GENERIC (lines 7-7) [recompile=true]
> >
> > ------CP createvar _mVar4 scratch_space//_p15260_172.31.3.116//_t0/temp3
> > true MATRIX binaryblock 1 1 1000 1000 -1 copy
> >
> > ------SPARK rangeReIndex MN.MATRIX.DOUBLE 1.SCALAR.INT.true
> > 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar4.MATRIX.DOUBLE
> > NONE
> >
> > ------CP castdts _mVar4.MATRIX.DOUBLE.false _Var5.SCALAR.STRING
> >
> > ------CP rmvar _mVar4
> >
> > ------CP print _Var5.SCALAR.STRING.false _Var6.SCALAR.STRING
> >
> > ------CP rmvar _Var5
> >
> > ------CP rmvar _Var6
> >
> > ------CP rmvar MN
> >
> > ----GENERIC (lines 9-9) [recompile=false]
> >
> > ------CP print DONE!.SCALAR.STRING.true _Var7.SCALAR.STRING
> >
> > ------CP rmvar _Var7
> >
> >
> > The actual error reports by the executor is:
> >
> >
> > # There is insufficient memory for the Java Runtime Environment to
> > continue.
> > # Native memory allocation (mmap) failed to map 481296384 bytes for
> > committing reserved memory.
> >
> >
> > I can send my Spark and YARN configurations as well if that would be
> > useful. Thanks a lot for your help.
> >
> >
> > Best,
> >
> >
> > Anthony
> >
> > On Thu, Jun 15, 2017 at 3:00 PM, Anthony Thomas <ahthomas@eng.ucsd.edu>
> > wrote:
> >
> > > Thanks Matthias and Glenn,
> > >
> > > I'll give these suggestions a try once I get back in the office
> tomorrow.
> > >
> > > Best,
> > >
> > > Anthony
> > >
> > >
> > > On Jun 15, 2017 12:36 PM, "Matthias Boehm" <mboehm7@googlemail.com>
> > wrote:
> > >
> > > well, I think Anthony already used -exec spark here; I would recommend
> to
> > > (1) fix the driver configuration via --driver-java-options "-Xmn2500m"
> > (we
> > > assume that the young generation does not exceed 10% of the max heap
> > > configuration) - this will help if the OOM comes from the driver, and
> (2)
> > > potentially increase the memory overhead of the executors (--conf
> > > spark.yarn.executor.memoryOverhead=10240) if ran on yarn and the node
> > > manager kill the executor processes because they exceed the container
> > > limits. If this does not help, please provide the -explain output and
> we
> > > have a closer look.
> > >
> > > Regards,
> > > Matthias
> > >
> > > On Thu, Jun 15, 2017 at 10:15 AM, Glenn Weidner <gweidner@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Anthony,
> > > >
> > > > Could you retry your scenario without the '-exec spark' option? By
> > > > default, SystemML will run in hybrid_spark mode which is more
> > efficient.
> > > >
> > > > Thanks,
> > > > Glenn
> > > >
> > > >
> > > > [image: Inactive hide details for Anthony Thomas ---06/15/2017
> 09:50:15
> > > > AM---Hi SystemML Developers, I'm running the following simple
> D]Anthony
> > > > Thomas ---06/15/2017 09:50:15 AM---Hi SystemML Developers, I'm
> running
> > > the
> > > > following simple DML script under SystemML 0.14:
> > > >
> > > > From: Anthony Thomas <ahthomas@eng.ucsd.edu>
> > > > To: dev@systemml.apache.org
> > > > Date: 06/15/2017 09:50 AM
> > > > Subject: Unexpected Executor Crash
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi SystemML Developers,
> > > >
> > > > I'm running the following simple DML script under SystemML 0.14:
> > > >
> > > > M = read('/scratch/M5.csv')
> > > > N = read('/scratch/M5.csv')
> > > > MN = M %*% N
> > > > if (1 == 1) {
> > > >    print(as.scalar(MN[1,1]))
> > > > }
> > > >
> > > > The matrix M is square and about 5GB on disk (stored in HDFS). I am
> > > > submitting the script to a 2 node spark cluster where each physical
> > > machine
> > > > has 30GB of RAM. I am using the following command to submit the job:
> > > >
> > > > $SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G
> > > > --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128
> > > > --verbose --conf
> > > > spark.serializer=org.apache.spark.serializer.KryoSerializer
> > > > $SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain
> runtime
> > > >
> > > > However, I consistently run into errors like:
> > > >
> > > > ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC
> > > > client
> > > > disassociated. Likely due to containers exceeding thresholds, or
> > network
> > > > issues. Check driver logs for WARN messages.
> > > >
> > > > and the job eventually aborts. Consulting the output of executors
> shows
> > > > they are crashing with OutOfMemory exceptions. Even if one executor
> > > needed
> > > > to store M,N and MN in memory simultaneously it seems like there
> should
> > > be
> > > > enough memory so I'm unsure why the executor is crashing. In
> addition,
> > I
> > > > was under the impression that Spark would spill to disk if there was
> > > > insufficient memory. I've tried various combinations of
> > > > increasing/decreasing the number of executor cores (from 1 to 8),
> using
> > > > more/fewer executors, increasing/decreasing Spark's memoryFraction,
> and
> > > > increasing/decreasing Spark's default parallelism all without
> success.
> > > Can
> > > > anyone offer any advice or suggestions to debug this issue further?
> I'm
> > > not
> > > > a very experienced Spark user so it's very possible I haven't
> > configured
> > > > something correctly. Please let me know if you'd like any further
> > > > information.
> > > >
> > > > Best,
> > > >
> > > > Anthony Thomas
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message