systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fei Hu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SYSTEMML-1760) Improve engine robustness of distributed SGD training
Date Tue, 25 Jul 2017 06:35:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099275#comment-16099275
] 

Fei Hu edited comment on SYSTEMML-1760 at 7/25/17 6:34 AM:
-----------------------------------------------------------

cc [~mboehm7], [~dusenberrymw], [~niketanpansare]    The following run time statistic is one
of the run time results by running the distribute MNIST example on the Spark cluster. Note
that the parfor result merge really took much more time than the other parts. Is it reasonable?


{quote}Total elapsed time:        1624.575 sec.
Total compilation time:        0.000 sec.
{color:#d04437}Total execution time:        1624.575 sec.{color}
Number of compiled Spark inst:    188.
Number of executed Spark inst:    6.
Cache hits (Mem, WB, FS, HDFS):    481/0/0/288.
Cache writes (WB, FS, HDFS):    214/0/108.
Cache times (ACQr/m, RLS, EXP):    1043.481/0.002/0.017/18.529 sec.
HOP DAGs recompiled (PRED, SB):    0/13.
HOP DAGs recompile time:    0.049 sec.
Functions recompiled:        1.
Functions recompile time:    0.157 sec.
Spark ctx create time (lazy):    0.006 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col):    0.000/0.000/0.000 secs.
ParFor loops optimized:        6.
ParFor optimize time:        0.151 sec.
ParFor initialize time:        0.000 sec.
{color:#d04437}ParFor result merge time:    1077.574 sec.{color}
ParFor total update in-place:    0/0/0
Total JIT compile time:        60.426 sec.
Total JVM GC count:        138.
{color:#d04437}Total JVM GC time:        220.124 sec.{color}{quote}


was (Author: tenma):
The following run time statistic is one of the run time results by running the distribute
MNIST example on the Spark cluster. Note that the parfor result merge really took much more
time than the other parts. Is it reasonable? cc [~mboehm7] [~dusenberrymw] [~niketanpansare]

{quote}Total elapsed time:        1624.575 sec.
Total compilation time:        0.000 sec.
{color:#d04437}Total execution time:        1624.575 sec.{color}
Number of compiled Spark inst:    188.
Number of executed Spark inst:    6.
Cache hits (Mem, WB, FS, HDFS):    481/0/0/288.
Cache writes (WB, FS, HDFS):    214/0/108.
Cache times (ACQr/m, RLS, EXP):    1043.481/0.002/0.017/18.529 sec.
HOP DAGs recompiled (PRED, SB):    0/13.
HOP DAGs recompile time:    0.049 sec.
Functions recompiled:        1.
Functions recompile time:    0.157 sec.
Spark ctx create time (lazy):    0.006 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col):    0.000/0.000/0.000 secs.
ParFor loops optimized:        6.
ParFor optimize time:        0.151 sec.
ParFor initialize time:        0.000 sec.
{color:#d04437}ParFor result merge time:    1077.574 sec.{color}
ParFor total update in-place:    0/0/0
Total JIT compile time:        60.426 sec.
Total JVM GC count:        138.
{color:#d04437}Total JVM GC time:        220.124 sec.{color}{quote}

> Improve engine robustness of distributed SGD training
> -----------------------------------------------------
>
>                 Key: SYSTEMML-1760
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>            Reporter: Mike Dusenberry
>            Assignee: Fei Hu
>
> Currently, we have a mathematical framework in place for training with distributed SGD
in a [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].
 This task aims to push this at scale to determine (1) the current behavior of the engine
(i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve
the robustness and performance for this scenario.  The distributed SGD framework from this
example has already been ported into Caffe2DML, and thus improvements made for this task will
directly benefit our efforts towards distributed training of Caffe models (and Keras in the
future).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message