systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fei Hu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-1774) Improve Parfor parallelism for deep learning
Date Tue, 18 Jul 2017 18:43:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091975#comment-16091975
] 

Fei Hu commented on SYSTEMML-1774:
----------------------------------

Our experiment plan to test the performance of the distributed MNIST LeNet example is:
# HYBRID_SPARK + plain parfor : ~1.5 hours
# SPARK + palin parfor: ~30 mins
# HYBRID_SPARK + REMOTE_SPARK parfor: failed
# SPARK + REMOTE_SPARE parfor: failed

After we get the run time for these four scenarios, we may get some hints about how to improve
the performance of distributed SGD training. 

Some new findings when running some experiments on the local machine using {{HYBRID_SPARK
+ REMOTE_SPARK parfor}}:
* When using the attached scala file to run the example, we will get the errors as shown in
{{Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt}}.
* However, it seemed that the above errors in the local machine are related to the {{batchSize}}
and the memory size for Spark:
                 a) {{batchSize}}: when changing it from {{2}} to {{16}} as {{val batchSize
= 16}}, the errors disappeared.
                 b) Memory size for Spark:
                      * When setting {{batchSize}} to be {{4}}, the same errors happened
                      * But when increasing the memory size for Spark to {{11.8GB}} by {{conf.set("spark.testing.memory",
memSize.toString)}} with {{batchSize = 4}}, the errors disappeared.   








> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, Explain_For_Spark_Mode.txt,
MNIST_Distrib_Sgd.scala, mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
each mini-batch could ideally run in parallel without interaction. We try to force {{parfor
(j in 1:parallel_batches)}} at line 137 of {{nn/examples/mnist_lenet_distrib_sgd.dml}} to
be {{parfor (j in 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use {{REMOTE_SPARK}}
mode, but got some errors about {{org.apache.sysml.runtime.DMLRuntimeException: Not supported:
Instructions of type other than CP instructions}} using the mode {{SPARK}}, and the error
{{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log information can
be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message