From dev-return-2600-archive-asf-public=cust-asf.ponee.io@systemml.apache.org Thu Mar 14 15:46:13 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A5A60180630 for ; Thu, 14 Mar 2019 16:46:12 +0100 (CET) Received: (qmail 88354 invoked by uid 500); 14 Mar 2019 15:46:11 -0000 Mailing-List: contact dev-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list dev@systemml.apache.org Received: (qmail 88334 invoked by uid 99); 14 Mar 2019 15:46:11 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2019 15:46:11 +0000 From: GitBox To: dev@systemml.apache.org Subject: [GitHub] [systemml] niketanpansare commented on issue #856: [SYSTEMML-540] Improve the performance of GPU lstm backward operator by passing the state Message-ID: <155257837115.13738.7043560427419784665.gitbox@gitbox.apache.org> Date: Thu, 14 Mar 2019 15:46:11 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit niketanpansare commented on issue #856: [SYSTEMML-540] Improve the performance of GPU lstm backward operator by passing the state URL: https://github.com/apache/systemml/pull/856#issuecomment-472928065 Setup: ``` N = 64 tmp = 0 for(i in 1:100) { [output, c, cache] = lstm::forward(x, w, b, return_seq, out0, c0) [dX, dW, db, dout0, dc0] = lstm::backward(output, c, x, w, b, return_seq, out0, c0, cache) c0 = c tmp = tmp - sum(dX) + sum(dW) - sum(dout0) + sum(db) - sum(dc0) } print(tmp) ``` The plots below return measure end-to-end runtime (which includes CUDA init and execution times of other instructions). For ballpark comparison, here are the stats for T=100, D=1000, M=1000, return_sequence=TRUE - PR: ``` SystemML Statistics: Total elapsed time: 17.769 sec. Total compilation time: 0.575 sec. Total execution time: 17.194 sec. Number of compiled Spark inst: 0. Number of executed Spark inst: 0. CUDA/CuLibraries init time: 5.079/1.659 sec. Number of executed GPU inst: 400. GPU mem alloc time (alloc(success/fail) / dealloc / set0): 0.038(0.038/0.000) / 0.015 / 0.026 sec. GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0): 316(316/0/1589) / 300 / 1905. GPU mem size (alloc (peak) / evict): 99.341 GB(778.687 MB) / 0 bytes. GPU mem tx time (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.027(0.000/0.000) / 7.055(0.000/0.000) / 0.000(0.000/0.000) sec. GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 5(0/0) / 900(0/0) / 0(0/0). GPU conversion time (sparseConv / sp2dense / dense2sp): 0.000 / 0.000 / 0.000 sec. GPU conversion count (sparseConv / sp2dense / dense2sp): 0 / 0 / 0. Cache hits (Mem, WB, FS, HDFS): 306/0/0/0. Cache writes (WB, FS, HDFS): 2/0/0. Cache times (ACQr/m, RLS, EXP): 7.059/0.018/0.019/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/500. HOP DAGs recompile time: 7.477 sec. Spark ctx create time (lazy): 0.000 sec. Spark trans counts (par,bc,col):0/0/0. Spark trans times (par,bc,col): 0.000/0.000/0.000 secs. Total JIT compile time: 6.133 sec. Total JVM GC count: 3. Total JVM GC time: 0.109 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 backward 6.648 100 2 gpu_lstm_backward 6.605 100 3 forward 2.481 100 4 gpu_lstm 2.416 100 5 rand 0.526 5 6 rlit 0.127 300 7 gpu_uak+ 0.076 200 8 rmvar 0.016 902 9 - 0.011 300 10 createvar 0.007 805 ``` - Apache master: ``` SystemML Statistics: Total elapsed time: 20.896 sec. Total compilation time: 0.619 sec. Total execution time: 20.277 sec. Number of compiled Spark inst: 0. Number of executed Spark inst: 0. CUDA/CuLibraries init time: 5.077/1.663 sec. Number of executed GPU inst: 400. GPU mem alloc time (alloc(success/fail) / dealloc / set0): 0.032(0.032/0.000) / 0.013 / 0.029 sec. GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0): 318(318/0/2087) / 300 / 2405. GPU mem size (alloc (peak) / evict): 133.918 GB(778.687 MB) / 0 bytes. GPU mem tx time (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.032(0.000/0.000) / 9.737(0.000/0.000) / 0.000(0.000/0.000) sec. GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 5(0/0) / 900(0/0) / 0(0/0). GPU conversion time (sparseConv / sp2dense / dense2sp): 0.000 / 0.000 / 0.000 sec. GPU conversion count (sparseConv / sp2dense / dense2sp): 0 / 0 / 0. Cache hits (Mem, WB, FS, HDFS): 306/0/0/0. Cache writes (WB, FS, HDFS): 2/0/0. Cache times (ACQr/m, RLS, EXP): 9.742/0.019/0.020/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/500. HOP DAGs recompile time: 10.223 sec. Spark ctx create time (lazy): 0.000 sec. Spark trans counts (par,bc,col):0/0/0. Spark trans times (par,bc,col): 0.000/0.000/0.000 secs. Total JIT compile time: 5.769 sec. Total JVM GC count: 3. Total JVM GC time: 0.119 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 backward 6.963 100 2 gpu_lstm_backward 6.865 100 3 forward 2.529 100 4 gpu_lstm 2.465 100 5 rand 0.559 5 6 rlit 0.129 300 7 gpu_uak+ 0.076 200 8 rmvar 0.015 902 9 - 0.011 300 10 createvar 0.004 705 ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services