systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fei Hu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SYSTEMML-1760) Improve engine robustness of distributed SGD training
Date Fri, 28 Jul 2017 19:05:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105497#comment-16105497
] 

Fei Hu edited comment on SYSTEMML-1760 at 7/28/17 7:04 PM:
-----------------------------------------------------------

cc [~mboehm7], [~dusenberrymw] [~niketanpansare] The following table shows the history of
performance improvement. After fixing the issues SYSTEMML-1762 and SYSTEMML-1774, the distributed
MNIST_LeNet model could be trained in parallel with the Hybrid_Spark and Remote_Spark parfor
mode. By changing the default Parfor_Result_Merge into REMOTE_SPARK, the run time reduced
a lot. It indicates that the result merge may be a bottleneck for the performance. 

!Runtime_Table.png!


was (Author: tenma):
The following table shows the history of performance improvement. After fixing the issues
SYSTEMML-1762 and SYSTEMML-1774, the distributed MNIST_LeNet model could be trained in parallel
with the Hybrid_Spark and Remote_Spark parfor mode. By changing the default Parfor_Result_Merge
into REMOTE_SPARK, the run time reduced a lot. It indicates that the result merge may be a
bottleneck for the performance. 

!Runtime_Table.png!

> Improve engine robustness of distributed SGD training
> -----------------------------------------------------
>
>                 Key: SYSTEMML-1760
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>            Reporter: Mike Dusenberry
>            Assignee: Fei Hu
>         Attachments: Runtime_Table.png
>
>
> Currently, we have a mathematical framework in place for training with distributed SGD
in a [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].
 This task aims to push this at scale to determine (1) the current behavior of the engine
(i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve
the robustness and performance for this scenario.  The distributed SGD framework from this
example has already been ported into Caffe2DML, and thus improvements made for this task will
directly benefit our efforts towards distributed training of Caffe models (and Keras in the
future).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message