From dev-return-4169-archive-asf-public=cust-asf.ponee.io@singa.apache.org Tue Feb 11 07:40:34 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7C52C18063F for ; Tue, 11 Feb 2020 08:40:34 +0100 (CET) Received: (qmail 52267 invoked by uid 500); 11 Feb 2020 07:40:33 -0000 Mailing-List: contact dev-help@singa.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.apache.org Delivered-To: mailing list dev@singa.apache.org Received: (qmail 52243 invoked by uid 99); 11 Feb 2020 07:40:33 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Feb 2020 07:40:33 +0000 From: GitBox To: dev@singa.apache.org Subject: [GitHub] [singa] chrishkchris edited a comment on issue #591: Dev branch cpu training problem (with conv and pool) Message-ID: <158140683379.32506.4784931782111835320.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Tue, 11 Feb 2020 07:40:33 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit chrishkchris edited a comment on issue #591: Dev branch cpu training problem (with conv and pool) URL: https://github.com/apache/singa/issues/591#issuecomment-584485839 I have tried using GCC OpenMP and Intel TBB (threading building block) when complile DNNL from source. The time is extremely slow (normal time per epoch should be around a minute), but the training loss results are correct. 1. GCC OpenMP ``` root@3edb30e30b08:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py Starting Epoch 0: Training loss = 564.547180, training accuracy = 0.800644 Evaluation accuracy = 0.931591, Elapsed Time = 1348.363244s Starting Epoch 1: Training loss = 229.964905, training accuracy = 0.922892 Evaluation accuracy = 0.959535, Elapsed Time = 1344.685418s Starting Epoch 2: Training loss = 163.646332, training accuracy = 0.944837 Evaluation accuracy = 0.973758, Elapsed Time = 1346.530425s Starting Epoch 3: Training loss = 135.699615, training accuracy = 0.954526 Evaluation accuracy = 0.970152, Elapsed Time = 1346.398193s Starting Epoch 4: Training loss = 115.944962, training accuracy = 0.962096 Evaluation accuracy = 0.968750, Elapsed Time = 1349.933991s Starting Epoch 5: Training loss = 102.581963, training accuracy = 0.965548 Evaluation accuracy = 0.976963, Elapsed Time = 1343.627475s Starting Epoch 6: Training loss = 91.995560, training accuracy = 0.969701 Evaluation accuracy = 0.980168, Elapsed Time = 1345.709435s Starting Epoch 7: Training loss = 85.334785, training accuracy = 0.971051 Evaluation accuracy = 0.977664, Elapsed Time = 1342.384448s Starting Epoch 8: Training loss = 81.609375, training accuracy = 0.972018 Evaluation accuracy = 0.981571, Elapsed Time = 1345.214866s Starting Epoch 9: Training loss = 76.690147, training accuracy = 0.974203 Evaluation accuracy = 0.977364, Elapsed Time = 1354.111479s ``` 2. TBB (threading building block) ``` root@3edb30e30b08:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py Starting Epoch 0: Training loss = 566.089539, training accuracy = 0.800527 Evaluation accuracy = 0.938201, Elapsed Time = 1571.624848s Starting Epoch 1: Training loss = 229.882874, training accuracy = 0.923192 Evaluation accuracy = 0.957833, Elapsed Time = 1569.219801s Starting Epoch 2: Training loss = 164.734573, training accuracy = 0.945137 Evaluation accuracy = 0.955929, Elapsed Time = 1567.359108s Starting Epoch 3: Training loss = 132.956802, training accuracy = 0.955310 Evaluation accuracy = 0.968550, Elapsed Time = 1572.159664s Starting Epoch 4: Training loss = 117.263237, training accuracy = 0.960646 Evaluation accuracy = 0.969151, Elapsed Time = 1570.090345s Starting Epoch 5: Training loss = 105.917274, training accuracy = 0.965115 Evaluation accuracy = 0.978466, Elapsed Time = 1569.966338s Starting Epoch 6: Training loss = 93.056519, training accuracy = 0.968700 Evaluation accuracy = 0.976362, Elapsed Time = 1571.289907s Starting Epoch 7: Training loss = 85.500954, training accuracy = 0.971101 Evaluation accuracy = 0.981771, Elapsed Time = 1572.169596s Starting Epoch 8: ^CTraceback (most recent call last): File "/root/dcsysh/singa/build/python/singa/singa_wrap.py", line 302, in __setattr__ = lambda self, name, value: _swig_setattr(self, Tensor, name, value) KeyboardInterrupt The above exception was the direct cause of the following exception: Traceback (most recent call last): File "mnist_cnn.py", line 293, in train_mnist_cnn(sgd=sgd, max_epoch=max_epoch, batch_size=batch_size) File "mnist_cnn.py", line 249, in train_mnist_cnn sgd.backward_and_update(loss) File "/root/dcsysh/singa/build/python/singa/opt.py", line 179, in backward_and_update for p, g in autograd.backward(loss): File "/root/dcsysh/singa/build/python/singa/autograd.py", line 166, in backward dxs = op._do_backward(*dys) File "/root/dcsysh/singa/build/python/singa/autograd.py", line 313, in _do_backward dxs = self.backward(*dys) File "/root/dcsysh/singa/build/python/singa/autograd.py", line 1256, in backward self.handle) SystemError: returned a result with an error set ``` 3. The old mkldnn in master branch, results copied from PR https://github.com/apache/singa/pull/579 ``` ubuntu@ip-172-31-24-48:~/singa/examples/autograd$ python3 mnist_cnn.py Starting Epoch 0: Training loss = 585.431152, training accuracy = 0.791739 Evaluation accuracy = 0.930088, Elapsed Time = 55.447133s Starting Epoch 1: Training loss = 232.831589, training accuracy = 0.922158 Evaluation accuracy = 0.967949, Elapsed Time = 55.337850s Starting Epoch 2: Training loss = 166.067307, training accuracy = 0.945788 Evaluation accuracy = 0.968550, Elapsed Time = 55.367847s Starting Epoch 3: Training loss = 136.865341, training accuracy = 0.954092 Evaluation accuracy = 0.973357, Elapsed Time = 55.358584s Starting Epoch 4: Training loss = 118.813286, training accuracy = 0.960195 Evaluation accuracy = 0.979567, Elapsed Time = 55.270505s Starting Epoch 5: Training loss = 106.185112, training accuracy = 0.964481 Evaluation accuracy = 0.975962, Elapsed Time = 55.281344s Starting Epoch 6: Training loss = 94.444023, training accuracy = 0.968016 Evaluation accuracy = 0.980970, Elapsed Time = 55.081426s Starting Epoch 7: Training loss = 88.213493, training accuracy = 0.970418 Evaluation accuracy = 0.982873, Elapsed Time = 54.912524s Starting Epoch 8: Training loss = 81.126442, training accuracy = 0.972886 Evaluation accuracy = 0.981470, Elapsed Time = 54.907317s Starting Epoch 9: Training loss = 77.790993, training accuracy = 0.974236 Evaluation accuracy = 0.974159, Elapsed Time = 54.915229s ``` So the dnnl may be around 300 times slower than the old mkldnn? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services