systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [systemml] niketanpansare commented on issue #856: [SYSTEMML-540] Improve the performance of GPU lstm backward operator by passing the state
Date Thu, 14 Mar 2019 15:46:11 GMT
niketanpansare commented on issue #856: [SYSTEMML-540] Improve the performance of GPU lstm
backward operator by passing the state
URL: https://github.com/apache/systemml/pull/856#issuecomment-472928065
 
 
   Setup:
   ```
   N = 64
   tmp = 0
   for(i in 1:100) {
           [output, c, cache] = lstm::forward(x, w, b, return_seq, out0, c0)
           [dX, dW, db, dout0, dc0] = lstm::backward(output, c, x, w, b, return_seq, out0,
c0, cache)
           c0 = c
           tmp = tmp - sum(dX) + sum(dW) - sum(dout0) + sum(db) - sum(dc0)
   }
   print(tmp)
   ```
   
   The plots below return measure end-to-end runtime (which includes CUDA init and execution
times of other instructions). For ballpark comparison, here are the stats for T=100, D=1000,
M=1000, return_sequence=TRUE
   
   - PR:
   ```
   SystemML Statistics:
   Total elapsed time:             17.769 sec.
   Total compilation time:         0.575 sec.
   Total execution time:           17.194 sec.
   Number of compiled Spark inst:  0.
   Number of executed Spark inst:  0.
   CUDA/CuLibraries init time:     5.079/1.659 sec.
   Number of executed GPU inst:    400.
   GPU mem alloc time  (alloc(success/fail) / dealloc / set0):     0.038(0.038/0.000) / 0.015
/ 0.026 sec.
   GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):       316(316/0/1589)
/ 300 / 1905.
   GPU mem size (alloc (peak) / evict):    99.341 GB(778.687 MB) / 0 bytes.
   GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.027(0.000/0.000)
/ 7.055(0.000/0.000) / 0.000(0.000/0.000) sec.
   GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 5(0/0) / 900(0/0)
/ 0(0/0).
   GPU conversion time  (sparseConv / sp2dense / dense2sp):        0.000 / 0.000 / 0.000 sec.
   GPU conversion count (sparseConv / sp2dense / dense2sp):        0 / 0 / 0.
   Cache hits (Mem, WB, FS, HDFS): 306/0/0/0.
   Cache writes (WB, FS, HDFS):    2/0/0.
   Cache times (ACQr/m, RLS, EXP): 7.059/0.018/0.019/0.000 sec.
   HOP DAGs recompiled (PRED, SB): 0/500.
   HOP DAGs recompile time:        7.477 sec.
   Spark ctx create time (lazy):   0.000 sec.
   Spark trans counts (par,bc,col):0/0/0.
   Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
   Total JIT compile time:         6.133 sec.
   Total JVM GC count:             3.
   Total JVM GC time:              0.109 sec.
   Heavy hitter instructions:
     #  Instruction        Time(s)  Count
     1  backward             6.648    100
     2  gpu_lstm_backward    6.605    100
     3  forward              2.481    100
     4  gpu_lstm             2.416    100
     5  rand                 0.526      5
     6  rlit                 0.127    300
     7  gpu_uak+             0.076    200
     8  rmvar                0.016    902
     9  -                    0.011    300
    10  createvar            0.007    805
   ```
   - Apache master:
   ```
   SystemML Statistics:
   Total elapsed time:             20.896 sec.
   Total compilation time:         0.619 sec.
   Total execution time:           20.277 sec.
   Number of compiled Spark inst:  0.
   Number of executed Spark inst:  0.
   CUDA/CuLibraries init time:     5.077/1.663 sec.
   Number of executed GPU inst:    400.
   GPU mem alloc time  (alloc(success/fail) / dealloc / set0):     0.032(0.032/0.000) / 0.013
/ 0.029 sec.
   GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):       318(318/0/2087)
/ 300 / 2405.
   GPU mem size (alloc (peak) / evict):    133.918 GB(778.687 MB) / 0 bytes.
   GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.032(0.000/0.000)
/ 9.737(0.000/0.000) / 0.000(0.000/0.000) sec.
   GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 5(0/0) / 900(0/0)
/ 0(0/0).
   GPU conversion time  (sparseConv / sp2dense / dense2sp):        0.000 / 0.000 / 0.000 sec.
   GPU conversion count (sparseConv / sp2dense / dense2sp):        0 / 0 / 0.
   Cache hits (Mem, WB, FS, HDFS): 306/0/0/0.
   Cache writes (WB, FS, HDFS):    2/0/0.
   Cache times (ACQr/m, RLS, EXP): 9.742/0.019/0.020/0.000 sec.
   HOP DAGs recompiled (PRED, SB): 0/500.
   HOP DAGs recompile time:        10.223 sec.
   Spark ctx create time (lazy):   0.000 sec.
   Spark trans counts (par,bc,col):0/0/0.
   Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
   Total JIT compile time:         5.769 sec.
   Total JVM GC count:             3.
   Total JVM GC time:              0.119 sec.
   Heavy hitter instructions:
     #  Instruction        Time(s)  Count
     1  backward             6.963    100
     2  gpu_lstm_backward    6.865    100
     3  forward              2.529    100
     4  gpu_lstm             2.465    100
     5  rand                 0.559      5
     6  rlit                 0.129    300
     7  gpu_uak+             0.076    200
     8  rmvar                0.015    902
     9  -                    0.011    300
    10  createvar            0.004    705
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message