systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SYSTEMML-2469) Large distributed paramserv overheads
Date Sat, 28 Jul 2018 00:36:00 GMT
Matthias Boehm created SYSTEMML-2469:
----------------------------------------

             Summary: Large distributed paramserv overheads
                 Key: SYSTEMML-2469
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
             Project: SystemML
          Issue Type: Bug
            Reporter: Matthias Boehm


Initial runs with the distributed paramserv implementation on a small cluster revealed that
it is working correctly while exhibiting large overheads. Below are the stats for mnist lenet,
10 epochs, ASP, update per EPOCH, on a cluster of 1+6 nodes (24 cores per worker node). 

{code}
otal elapsed time:             687.743 sec.
Total compilation time:         3.815 sec.
Total execution time:           683.928 sec.
Number of compiled Spark inst:  330.
Number of executed Spark inst:  0.
Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
Cache writes (WB, FS, HDFS):    29856/5271/0.
Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/1629.
HOP DAGs recompile time:        4.878 sec.
Functions recompiled:           1.
Functions recompile time:       0.097 sec.
Spark ctx create time (lazy):   22.222 sec.
Spark trans counts (par,bc,col):2/1/0.
Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
Paramserv total num workers:    144.
Paramserv setup time:           68.259 secs.
Paramserv grad compute time:    6952.163 secs.
Paramserv model update time:    2453.448/422.955 secs.
Paramserv model broadcast time: 24.982 secs.
Paramserv batch slice time:     0.204 secs.
Paramserv RPC request time:     51611.210 secs.
ParFor loops optimized:         1.
ParFor optimize time:           0.462 sec.
ParFor initialize time:         0.049 sec.
ParFor result merge time:       0.028 sec.
ParFor total update in-place:   0/188/188
Total JIT compile time:         98.786 sec.
Total JVM GC count:             68.
Total JVM GC time:              25.858 sec.
Heavy hitter instructions:
  #  Instruction      Time(s)  Count
  1  paramserv        665.479      1
  2  +                182.410  18636
  3  conv2d_bias_add  150.938    376
  4  sqrt              69.768  11528
  5  /                 54.836  11732
  6  ba+*              45.901    376
  7  *                 38.046  11727
  8  -                 37.428  12096
  9  ^2                35.533   6344
 10  exp               21.022    188
{code}

There seem to be three distinct issues:
* Too larger number of tasks on assembling the distributed input data (in the number of rows,
i.e., >50,000 tasks), which makes the distributed data partitioning very slow (multiple
minutes).
* Evictions from the buffer pool at the driver node (see cache writes). This is likely due
to disabling cleanup (and missing explicit cleanup) of all RPC objects.
* Large RPC overhead: This might be due to the evictions happening in the critical path and
all 144 workers waiting with their RPC requests. However, in addition we should also double
check that the number of RPC handler threads is correct, if we could get the serialization
and communication out of the critical (i.e., synchronized) path of model updates, and address
unnecessary serialization/deserialization overheads.

[~Guobao] I'll help reducing the serialization/deserialization overheads, but it would be
great if you could have a look into the other issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message