singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
Date Thu, 01 Aug 2019 13:55:50 GMT
chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309709702
 
 

 ##########
 File path: src/api/config.i
 ##########
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 1 August 2019:
   
   Concerning the above error, I found that there is a different between the implementation
of `class _BatchNorm2d(Operation):` in master branch and dist_new branch.
   
   In autograd.py, both the master branch and dist_new branch has modified (or debugged) the
conv2d and batchnorm operator, but they modified it differently. Meanwhile, both conv2d in
the master branch and dist_new branch can train and reduce loss of mnist simple CNN, so there
is no big problem. However, the batch normalization is a much more complex case, because it
includes non-training variables that are running means and running variances.
   
   In the master branch, the running means and running variances (non-training variables)
are in the forward function: `def forward(self, x, scale, bias, running_mean, running_var):`
   https://github.com/apache/incubator-singa/blob/master/python/singa/autograd.py#L1099
   
   When I run the code using the master branch dockerfile, the error is as follows:
   ```
   root@26c9db193eb0:~/incubator-singa/examples/autograd# python3 resnet.py
   Start intialization............
     0%|                                                                                 
                    | 0/200 [00:00<?, ?it/s]
   Traceback (most recent call last):
     File "resnet.py", line 249, in <module>
       for p, g in autograd.backward(loss):
     File "/root/incubator-singa/build/python/singa/autograd.py", line 135, in backward
       % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=5) and dx (=3) not match
   ```
   I think the error is because the running_mean and running_var are in the forward function
input arguments but are not training variables, so there are supposed to be three src ops
but finally found 5.
   
   Meanwhile, the dist_new branch has modified the batchnorm function (commit 2b3a857 by user
ubuntu on Apr14) by moving the input arguments running_mean and running_var into the initialization
function:
   `def __init__(self, handle, running_mean, running_var, name=None):`
   `def forward(self, x, scale, bias):`
   https://github.com/xuewanqi/incubator-singa/blob/dist_new/python/singa/autograd.py#L1096
   This one can run successfully but I am not sure if it can train and reduce loss.
   
   Next, I will try training the resnet with real dataset to see if it can reduce the loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message