From dev-return-2600-archive-asf-public=cust-asf.ponee.io@systemml.apache.org  Thu Mar 14 15:46:13 2019
Return-Path: <dev-return-2600-archive-asf-public=cust-asf.ponee.io@systemml.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A5A60180630
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 14 Mar 2019 16:46:12 +0100 (CET)
Received: (qmail 88354 invoked by uid 500); 14 Mar 2019 15:46:11 -0000
Mailing-List: contact dev-help@systemml.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@systemml.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@systemml.apache.org>
List-Post: <mailto:dev@systemml.apache.org>
List-Id: <dev.systemml.apache.org>
Reply-To: dev@systemml.apache.org
Delivered-To: mailing list dev@systemml.apache.org
Received: (qmail 88334 invoked by uid 99); 14 Mar 2019 15:46:11 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2019 15:46:11 +0000
From: GitBox <git@apache.org>
To: dev@systemml.apache.org
Subject: [GitHub] [systemml] niketanpansare commented on issue #856: [SYSTEMML-540]
 Improve the performance of GPU lstm backward operator by passing the state
Message-ID: <155257837115.13738.7043560427419784665.gitbox@gitbox.apache.org>
Date: Thu, 14 Mar 2019 15:46:11 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

niketanpansare commented on issue #856: [SYSTEMML-540] Improve the performance of GPU lstm backward operator by passing the state
URL: https://github.com/apache/systemml/pull/856#issuecomment-472928065
 
 
   Setup:
   ```
   N = 64
   tmp = 0
   for(i in 1:100) {
           [output, c, cache] = lstm::forward(x, w, b, return_seq, out0, c0)
           [dX, dW, db, dout0, dc0] = lstm::backward(output, c, x, w, b, return_seq, out0, c0, cache)
           c0 = c
           tmp = tmp - sum(dX) + sum(dW) - sum(dout0) + sum(db) - sum(dc0)
   }
   print(tmp)
   ```
   
   The plots below return measure end-to-end runtime (which includes CUDA init and execution times of other instructions). For ballpark comparison, here are the stats for T=100, D=1000, M=1000, return_sequence=TRUE
   
   - PR:
   ```
   SystemML Statistics:
   Total elapsed time:             17.769 sec.
   Total compilation time:         0.575 sec.
   Total execution time:           17.194 sec.
   Number of compiled Spark inst:  0.
   Number of executed Spark inst:  0.
   CUDA/CuLibraries init time:     5.079/1.659 sec.
   Number of executed GPU inst:    400.
   GPU mem alloc time  (alloc(success/fail) / dealloc / set0):     0.038(0.038/0.000) / 0.015 / 0.026 sec.
   GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):       316(316/0/1589) / 300 / 1905.
   GPU mem size (alloc (peak) / evict):    99.341 GB(778.687 MB) / 0 bytes.
   GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.027(0.000/0.000) / 7.055(0.000/0.000) / 0.000(0.000/0.000) sec.
   GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 5(0/0) / 900(0/0) / 0(0/0).
   GPU conversion time  (sparseConv / sp2dense / dense2sp):        0.000 / 0.000 / 0.000 sec.
   GPU conversion count (sparseConv / sp2dense / dense2sp):        0 / 0 / 0.
   Cache hits (Mem, WB, FS, HDFS): 306/0/0/0.
   Cache writes (WB, FS, HDFS):    2/0/0.
   Cache times (ACQr/m, RLS, EXP): 7.059/0.018/0.019/0.000 sec.
   HOP DAGs recompiled (PRED, SB): 0/500.
   HOP DAGs recompile time:        7.477 sec.
   Spark ctx create time (lazy):   0.000 sec.
   Spark trans counts (par,bc,col):0/0/0.
   Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
   Total JIT compile time:         6.133 sec.
   Total JVM GC count:             3.
   Total JVM GC time:              0.109 sec.
   Heavy hitter instructions:
     #  Instruction        Time(s)  Count
     1  backward             6.648    100
     2  gpu_lstm_backward    6.605    100
     3  forward              2.481    100
     4  gpu_lstm             2.416    100
     5  rand                 0.526      5
     6  rlit                 0.127    300
     7  gpu_uak+             0.076    200
     8  rmvar                0.016    902
     9  -                    0.011    300
    10  createvar            0.007    805
   ```
   - Apache master:
   ```
   SystemML Statistics:
   Total elapsed time:             20.896 sec.
   Total compilation time:         0.619 sec.
   Total execution time:           20.277 sec.
   Number of compiled Spark inst:  0.
   Number of executed Spark inst:  0.
   CUDA/CuLibraries init time:     5.077/1.663 sec.
   Number of executed GPU inst:    400.
   GPU mem alloc time  (alloc(success/fail) / dealloc / set0):     0.032(0.032/0.000) / 0.013 / 0.029 sec.
   GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):       318(318/0/2087) / 300 / 2405.
   GPU mem size (alloc (peak) / evict):    133.918 GB(778.687 MB) / 0 bytes.
   GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.032(0.000/0.000) / 9.737(0.000/0.000) / 0.000(0.000/0.000) sec.
   GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 5(0/0) / 900(0/0) / 0(0/0).
   GPU conversion time  (sparseConv / sp2dense / dense2sp):        0.000 / 0.000 / 0.000 sec.
   GPU conversion count (sparseConv / sp2dense / dense2sp):        0 / 0 / 0.
   Cache hits (Mem, WB, FS, HDFS): 306/0/0/0.
   Cache writes (WB, FS, HDFS):    2/0/0.
   Cache times (ACQr/m, RLS, EXP): 9.742/0.019/0.020/0.000 sec.
   HOP DAGs recompiled (PRED, SB): 0/500.
   HOP DAGs recompile time:        10.223 sec.
   Spark ctx create time (lazy):   0.000 sec.
   Spark trans counts (par,bc,col):0/0/0.
   Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
   Total JIT compile time:         5.769 sec.
   Total JVM GC count:             3.
   Total JVM GC time:              0.119 sec.
   Heavy hitter instructions:
     #  Instruction        Time(s)  Count
     1  backward             6.963    100
     2  gpu_lstm_backward    6.865    100
     3  forward              2.529    100
     4  gpu_lstm             2.465    100
     5  rand                 0.559      5
     6  rlit                 0.129    300
     7  gpu_uak+             0.076    200
     8  rmvar                0.015    902
     9  -                    0.011    300
    10  createvar            0.004    705
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services