systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SYSTEMML-1774) Improve Parfor parallelism for deep learning
Date Tue, 18 Jul 2017 18:44:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091976#comment-16091976
] 

Matthias Boehm edited comment on SYSTEMML-1774 at 7/18/17 6:43 PM:
-------------------------------------------------------------------

ad 2) Forced spark execution mode together with parfor REMOTE_SPARK are invalid because it
would require to run all operations as distributed spark operations as well as the surrounding
parfor as a distributed spark operation. It is invalid because there are no nested spark/mapreduce
operations (i.e., RDD operations that call another RDD operation) since this could lead to
deadlocks. By specifying spark and (and thus forcing local parfor) you effectively run multiple
concurrent distributed operations on the cluster which leads to full cluster utilization on
small data.


was (Author: mboehm7):
ad 2) Forced spark execution mode together with parfor REMOTE_SPARK are invalid because it
would require to run all operations as distributed spark operations as well as the surrounding
parfor as a distributed spark operation. It is invalid because there are no nested spark/mapreduce
operations (i.e., RDD operations that calls another RDD operation) since this could lead to
deadlocks. By specifying spark and (and thus forcing local parfor) you effectively run multiple
concurrent distributed operations on the cluster which leads to full cluster utilization on
small data.

> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, Explain_For_Spark_Mode.txt,
MNIST_Distrib_Sgd.scala, mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
each mini-batch could ideally run in parallel without interaction. We try to force {{parfor
(j in 1:parallel_batches)}} at line 137 of {{nn/examples/mnist_lenet_distrib_sgd.dml}} to
be {{parfor (j in 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use {{REMOTE_SPARK}}
mode, but got some errors about {{org.apache.sysml.runtime.DMLRuntimeException: Not supported:
Instructions of type other than CP instructions}} using the mode {{SPARK}}, and the error
{{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log information can
be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message