singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [singa] chrishkchris commented on a change in pull request #728: Fix create_cuda_gpu
Date Wed, 10 Jun 2020 09:43:45 GMT

chrishkchris commented on a change in pull request #728:
URL: https://github.com/apache/singa/pull/728#discussion_r437997370



##########
File path: python/singa/device.py
##########
@@ -113,10 +113,8 @@ def create_cuda_gpu(set_default=False):
         a swig converted CudaGPU device.
     '''
     assert singa.USE_CUDA, 'SINGA has not been compiled with CUDA enabled.'
-    devices = singa.Platform.CreateCudaGPUs(1)
-    if set_default:
-        set_default_device(devices[0])
-    return devices[0]
+    device = create_cuda_gpu_on(0, set_default)

Review comment:
       I added the default_gpu_device
   
   Moreover, I also removed some unused code `set_default_device(device)`, which was used
to set default device for graph operation. It was used to prevent the buffering of `to_device`
operation, which repeats the `to_device` every iteration. However, since the new layer API
do the initialization separately, the `set_default_device(device)` is no longer necessary.
I have tested the resnet cifar10 the training is ok
   
   ```
   root@56142bc34887:~/dcsysh/singa/examples/cnn# mpiexec -np 8 python3 train_mpi.py resnet
cifar10 -b 32 -l 0.04
   Starting Epoch 0:
   Training loss = 3952.119385, training accuracy = 0.216567
   Evaluation accuracy = 0.342648, Elapsed Time = 52.988312s
   Starting Epoch 1:
   Training loss = 2519.932373, training accuracy = 0.399439
   Evaluation accuracy = 0.467849, Elapsed Time = 52.414376s
   Starting Epoch 2:
   Training loss = 2165.224854, training accuracy = 0.497937
   Evaluation accuracy = 0.560998, Elapsed Time = 52.504168s
   Starting Epoch 3:
   Training loss = 1884.613525, training accuracy = 0.565605
   Evaluation accuracy = 0.596755, Elapsed Time = 52.721652s
   Starting Epoch 4:
   Training loss = 1682.462158, training accuracy = 0.617188
   Evaluation accuracy = 0.643429, Elapsed Time = 52.857880s
   Starting Epoch 5:
   Training loss = 1514.762329, training accuracy = 0.654888
   Evaluation accuracy = 0.689002, Elapsed Time = 52.902534s
   Starting Epoch 6:
   Training loss = 1372.399536, training accuracy = 0.689283
   Evaluation accuracy = 0.708434, Elapsed Time = 53.047120s
   Starting Epoch 7:
   Training loss = 1220.632446, training accuracy = 0.726362
   Evaluation accuracy = 0.743389, Elapsed Time = 53.035040s
   Starting Epoch 8:
   Training loss = 1110.090942, training accuracy = 0.751182
   Evaluation accuracy = 0.761919, Elapsed Time = 53.105462s
   Starting Epoch 9:
   Training loss = 1006.458618, training accuracy = 0.775160
   Evaluation accuracy = 0.772436, Elapsed Time = 53.146286s
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message