systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-1774) Improve Parfor parallelism for deep learning
Date Tue, 18 Jul 2017 01:54:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090967#comment-16090967
] 

Matthias Boehm commented on SYSTEMML-1774:
------------------------------------------

well, of course I'm happy to help here but let's separate the individual issues first.

1) NPE in ConvolutionCPInstruction: [~niketanpansare] could you please have a look into this
issue? The compiled -1 parameter is a bit suspicious. Anyway, it should not throw a nullpointer.
Also, why is there a ConvolutionUtils.scalarOperations - these convolution operations should
call the existing scalar operations.

2) Parfor REMOTE_SPARK: Just to be clear running in spark execution mode and forcing REMOTE_SPARK
is an invalid configuration. We have the mechanisms to force the recompile to CP for all instructions
in the parfor body but this does not apply for conflicting configurations.

The real issue here is the need to force spark and/or remote_spark at all. No library of dml
scripts should force REMOTE_SPARK (other than for testing) because it can create many issues
such as unnecessary OOMs. If there are limitations of size propagation which prevent us from
compiling this automatically if beneficial, we should fix the underlying root cause. [~Tenma]
and [~dusenberrymw] could you please provide the configuration of a scenario where REMOTE_SPARK
was beneficial but not automatically chosen and I'll take care if it. 


> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, Explain_For_Spark_Mode.txt,
MNIST_Distrib_Sgd.scala, mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
each mini-batch could ideally run in parallel without interaction. We try to force {{parfor
(j in 1:parallel_batches)}} at line 137 of {{nn/examples/mnist_lenet_distrib_sgd.dml}} to
be {{parfor (j in 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use {{REMOTE_SPARK}}
mode, but got some errors about {{org.apache.sysml.runtime.DMLRuntimeException: Not supported:
Instructions of type other than CP instructions}} using the mode {{SPARK}}, and the error
{{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log information can
be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message