systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SYSTEMML-2398) Paramserv ASP aggregation overhead in on update per epoch
Date Fri, 15 Jun 2018 06:07:00 GMT
Matthias Boehm created SYSTEMML-2398:
----------------------------------------

             Summary: Paramserv ASP aggregation overhead in on update per epoch
                 Key: SYSTEMML-2398
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
             Project: SystemML
          Issue Type: Bug
            Reporter: Matthias Boehm


Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
{code}
SystemML Statistics:
Total elapsed time:		449.548 sec.
Total compilation time:		1.995 sec.
Total execution time:		447.553 sec.
Number of compiled MR Jobs:	0.
Number of executed MR Jobs:	0.
Cache hits (Mem, WB, FS, HDFS):	970241/0/0/2.
Cache writes (WB, FS, HDFS):	55191/0/0.
Cache times (ACQr/m, RLS, EXP):	1.048/0.120/1.087/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/13582.
HOP DAGs recompile time:	24.473 sec.
Functions recompiled:		1.
Functions recompile time:	0.013 sec.
Paramserv func number of workers:	79.
Paramserv func total gradients compute time:	1714.962 secs.
Paramserv func total aggregation time:	428.499 secs.
Paramserv func model broadcasting time:	2.080 secs.
Paramserv func total batch slicing time:	0.0190000000 secs.
Total JIT compile time:		37.461 sec.
Total JVM GC count:		66.
Total JVM GC time:		7.098 sec.
Heavy hitter instructions:
  #  Instruction             Time(s)  Count
  1  conv2d_bias_add         719.111  13768
  2  paramserv               437.051      1
  3  relu_backward           210.414  20370
  4  ba+*                    180.001  40928
  5  conv2d_backward_filter  175.104  13580
  6  +*                      156.714  81480
  7  conv2d_backward_data    140.779   6790
  8  *                       123.502  95173
  9  -*                      104.058  54320
 10  -                        94.502  74985
{code}

As we see the aggregation is a major bottleneck. This is unexpected due to the coarse-grained
update per epoch. [~Guobao] could you please have a look and profile where this is coming
from?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message