Mailing-List: contact issues-help@systemml.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@systemml.apache.org
Date: Thu, 3 Aug 2017 17:00:09 +0000 (UTC)
From: "Mike Dusenberry (JIRA)" <jira@apache.org>
To: issues@systemml.apache.org
Message-ID: <JIRA.13086422.1499819192000.86122.1501779609932@Atlassian.JIRA>
In-Reply-To: <JIRA.13086422.1499819192000@Atlassian.JIRA>
References: <JIRA.13086422.1499819192000@Atlassian.JIRA> <JIRA.13086422.1499819192085@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SYSTEMML-1760) Improve engine robustness of
 distributed SGD training
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 03 Aug 2017 17:00:16 -0000


    [ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113085#comment-16113085 ] 

Mike Dusenberry commented on SYSTEMML-1760:
-------------------------------------------

[~Tenma] Awesome!  That is a great amount of speedup.  Now that we've identified that the parfor optimizer is not choosing the optimal plan for this type scenario, we can use these experiments to make improvements so that a naive usage of the parfor yields a plan with the same performance (or better!).

> Improve engine robustness of distributed SGD training
> -----------------------------------------------------
>
>                 Key: SYSTEMML-1760
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>            Reporter: Mike Dusenberry
>            Assignee: Fei Hu
>         Attachments: Runtime_Table.png
>
>
> Currently, we have a mathematical framework in place for training with distributed SGD in a [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].  This task aims to push this at scale to determine (1) the current behavior of the engine (i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve the robustness and performance for this scenario.  The distributed SGD framework from this example has already been ported into Caffe2DML, and thus improvements made for this task will directly benefit our efforts towards distributed training of Caffe models (and Keras in the future).


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)