systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@googlemail.com>
Subject Re: Parfor optimizer getting stuck
Date Thu, 07 Sep 2017 07:01:21 GMT
thanks again for catching these issues Rajarshi. I'd like to briefly
summarize their resolutions.

ad 1) Most likely this was caused by a configuration issue - specifically,
the default parallelism being set to less than the number of executors,
leading to the number of cores per executor getting zero and thus the
memory budget per core to INF. SystemML master now has a patch that
increases robustness for such cases and improves the parfor debug output to
log the relevant spark cluster configuration.

ad 2) After an offline discussion with Arvind, we decided to remove this
brittle rewrite on update-in-place for parfor intermediates because we
didn't want to stall the 0.15 release and the asymptotic behavior of this
rewrite was at least squared in the number of nodes and candidates. The
performance impact is limited because later we introduced a generic
update-in-place rewrite for all for, parfor, and while loops. However, for
the 1.0 release we'll reimplement this more advanced update-in-place
rewrite from scratch.

Regards,
Matthias

On Mon, Sep 4, 2017 at 12:46 AM, Matthias Boehm1 <Matthias.Boehm1@ibm.com>
wrote:

> thanks for sharing Rajarshi - well, this is definitely a bug and needs
> fixing before our 0.15 release.
>
> Looking over the output, there are really two different issues here:
>
> 1) The remote memory is Infinity, which causes the optimizer to go for
> REMOTE_SPARK despite the unknown sizes (and max memory estimates) and
> smaller degree of parallelism. This is because it mistakenly assumes all
> operations could be compiled to CP as they would "fit" into remote executor
> memory. We need to (a) make this decision more resilient and (b) find the
> root cause of the messed up memory budget.
>
> 2) Looking over the trace of the parfor optimizer, it seems to get stuck
> in "rewriteSetInPlaceResultIndexing". @Arvind: I remember that we had a
> similar issue when we introduced/extended this rewrite more than a year
> ago. I'll have a look into this tomorrow unless you want to handle it.
>
> Regards,
> Matthias
>
> [image: Inactive hide details for Rajarshi Bhadra ---09/03/2017 11:52:20
> PM---Hi, I am working on a custom tree based algorithm which I]Rajarshi
> Bhadra ---09/03/2017 11:52:20 PM---Hi, I am working on a custom tree based
> algorithm which I am trying to develop
>
> From: Rajarshi Bhadra <bhadrarajarshi9@gmail.com>
> To: dev@systemml.apache.org
> Cc: Matthias Boehm1 <matthias.boehm1@ibm.com>
> Date: 09/03/2017 11:52 PM
> Subject: Parfor optimizer getting stuck
> ------------------------------
>
>
>
> Hi,
>
> I am working on a custom tree based algorithm which I am trying to develop
> using SystemML. I am using Version 1.0.0-SNAPSHOT. Now my issue is the
> parfor statement is getting stuck somewhere and I am not getting any error
> report or output so I am unable to determine what the issue might be.
> However I have been able to get a log of the parfor which is as follows
>
>
> 17/09/04 06:59:12 DEBUG Optimizer: --- RULEBASED OPTIMIZER -------
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: Optimize w/
> max_mem=63716MB/InfinityMB/InfinityMB, max_k=32/1/1).
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: estimated mem (serial
> exec) M=52MB
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set data
> partitioner' - result=NONE ()
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'remove
> unnecessary compare matrix' - result=false ()
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set result
> partitioning' - result=false
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: estimated new mem
> (serial exec) M=52MB
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: estimated new mem
> (serial exec, all CP) M=273068MB
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: estimated new mem (cond
> partitioning) M=52MB
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set execution
> strategy' - result=REMOTE_SPARK (recompile=true
> )
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set operation
> exec type CP' - result=198
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'enable data
> colocation' - result=false
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set partition
> replication factor' - result=false
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set export
> replication factor' - result=true (3)
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'enable nested
> parallelism' - result=false
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set degree of
> parallelism' - result=(see EXPLAIN)
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set task
> partitioner' - result=NAIVE
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set fused data
> partitioning and execution' - result=false
> 17/09/04 06:59:12 DEBUG Optimizer: RULEBASED OPT: rewrite 'set transpose
> sparse vector operations' - result=false
>
> It would be great if someone can help me out with this issue
>
> Thank you
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message