systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LI Guobao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-2398) Paramserv ASP aggregation overhead on update per epoch
Date Fri, 15 Jun 2018 11:56:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513720#comment-16513720
] 

LI Guobao commented on SYSTEMML-2398:
-------------------------------------

[~mboehm7], well, in fact, I put the timer inside the agg service function which could be
invoked also for local worker's model updating. Hence, I need to add an additional indication
to seperate the aggregation and local model updating.

> Paramserv ASP aggregation overhead on update per epoch
> ------------------------------------------------------
>
>                 Key: SYSTEMML-2398
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Priority: Major
>
> Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
> {code}
> SystemML Statistics:
> Total elapsed time:		449.548 sec.
> Total compilation time:		1.995 sec.
> Total execution time:		447.553 sec.
> Number of compiled MR Jobs:	0.
> Number of executed MR Jobs:	0.
> Cache hits (Mem, WB, FS, HDFS):	970241/0/0/2.
> Cache writes (WB, FS, HDFS):	55191/0/0.
> Cache times (ACQr/m, RLS, EXP):	1.048/0.120/1.087/0.000 sec.
> HOP DAGs recompiled (PRED, SB):	0/13582.
> HOP DAGs recompile time:	24.473 sec.
> Functions recompiled:		1.
> Functions recompile time:	0.013 sec.
> Paramserv func number of workers:	79.
> Paramserv func total gradients compute time:	1714.962 secs.
> Paramserv func total aggregation time:	428.499 secs.
> Paramserv func model broadcasting time:	2.080 secs.
> Paramserv func total batch slicing time:	0.0190000000 secs.
> Total JIT compile time:		37.461 sec.
> Total JVM GC count:		66.
> Total JVM GC time:		7.098 sec.
> Heavy hitter instructions:
>   #  Instruction             Time(s)  Count
>   1  conv2d_bias_add         719.111  13768
>   2  paramserv               437.051      1
>   3  relu_backward           210.414  20370
>   4  ba+*                    180.001  40928
>   5  conv2d_backward_filter  175.104  13580
>   6  +*                      156.714  81480
>   7  conv2d_backward_data    140.779   6790
>   8  *                       123.502  95173
>   9  -*                      104.058  54320
>  10  -                        94.502  74985
> {code}
> As we see the aggregation is a major bottleneck. This is unexpected due to the coarse-grained
update per epoch. [~Guobao] could you please have a look and profile where this is coming
from?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message