systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-1774) Improve Parfor parallelism for deep learning
Date Wed, 19 Jul 2017 03:18:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092508#comment-16092508
] 

Matthias Boehm commented on SYSTEMML-1774:
------------------------------------------

ok after some initial debugging with {{hybrid_spark + parfor}}} and driver Xmx 4g, it seems
that the parfor optimizer decided for a parallel degree of 1 (single-threaded, which caused
the slow down) due to the following (unknown) memory estimates:
{code}
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=BIAS_ADD, name=26_out, memest=7.635730732E9).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=MAX_POOLING, name=28_out, memest=7.63573052E9).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=BIAS_ADD, name=29_out, memest=7.635730988E9).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=MAX_POOLING, name=31_out, memest=7.63573052E9).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=MAX_POOLING_BACKWARD, name=42_dX, memest=1.1453595736E10).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=DIRECT_CONV2D_BACKWARD_DATA, name=45_dX, memest=7.636140164E9).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=DIRECT_CONV2D_BACKWARD_FILTER, name=45_dW, memest=7.636140164E9).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=MAX_POOLING_BACKWARD, name=46_dX, memest=1.1453595736E10).
17/07/18 19:50:15 WARN opt.CostEstimator: Memory estimate larger than budget but CP exec type
(op=DIRECT_CONV2D_BACKWARD_FILTER, name=48_dW, memest=3.819088816E9).
{code}

For more evidence, here is a fragment of the parfor plan with {{hybrid_spark + parfor}}

{code}
----------------------------
 EXPLAIN OPT TREE (type=ABSTRACT_PLAN, size=122)
----------------------------
--PARFOR (lines 137-213), exec=CP, k=1, dp=NONE, tp=FACTORING, rm=REMOTE_SPARK
----GENERIC (lines 139-162), exec=CP, k=1
------rix, exec=CP, k=1
------b(+), exec=CP, k=1
------b(%%), exec=CP, k=1
------b(*), exec=CP, k=1
------b(-), exec=CP, k=1
------u(nrow), exec=CP, k=1
------b(min), exec=CP, k=1
------b(-), exec=CP, k=1
------b(+), exec=CP, k=1
------rix, exec=CP, k=1
------BIAS_ADD, exec=CP, k=16
------DIRECT_CONV2D, exec=CP, k=16
{code}

in contrast, the parfor plan with {{spark + parfor}} looks as follows:

{code}
----------------------------
 EXPLAIN OPT TREE (type=ABSTRACT_PLAN, size=122)
----------------------------
--PARFOR (lines 137-213), exec=CP, k=4, dp=NONE, tp=NAIVE, rm=REMOTE_SPARK
----GENERIC (lines 139-162), exec=CP, k=1
------rix, exec=SPARK, k=1
------b(+), exec=SPARK, k=1
------b(%%), exec=SPARK, k=1
------b(*), exec=SPARK, k=1
------b(-), exec=SPARK, k=1
------u(nrow), exec=CP, k=1
------b(min), exec=SPARK, k=1
------b(-), exec=SPARK, k=1
------b(+), exec=SPARK, k=1
------rix, exec=SPARK, k=1
------BIAS_ADD, exec=CP, k=4
------DIRECT_CONV2D, exec=CP, k=4
{code}

Note that the degree of parallelism of 4 is actually incorrect given the unknown memory estimates
of convolution ops above. This requires some deeper analysis.

So the bottom line is, the real issue originates from size propagation issues and there are
two action items here: (1) address the size propagation issue, and (2) fix the bug of potentially
incorrect handling of memory estimates for convolution ops with forced spark execution mode.

> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, Explain_For_Spark_Mode.txt,
MNIST_Distrib_Sgd.scala, mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
each mini-batch could ideally run in parallel without interaction. We try to force {{parfor
(j in 1:parallel_batches)}} at line 137 of {{nn/examples/mnist_lenet_distrib_sgd.dml}} to
be {{parfor (j in 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use {{REMOTE_SPARK}}
mode, but got some errors about {{org.apache.sysml.runtime.DMLRuntimeException: Not supported:
Instructions of type other than CP instructions}} using the mode {{SPARK}}, and the error
{{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log information can
be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message