singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [singa] chrishkchris edited a comment on issue #646: SINGA-487 Test case for distributed module
Date Thu, 02 Apr 2020 06:27:33 GMT
chrishkchris edited a comment on issue #646: SINGA-487 Test case for distributed module
URL: https://github.com/apache/singa/pull/646#issuecomment-607628829
 
 
   > > > @XJDKC I also added `self.fused_all_reduce([p.data], send=False)` in the
function `backward_and_partial_update`
   > > > to adopt the change last time in the `fused_all_reduce` function
   > > 
   > > 
   > > You can pass the whole plist to the fused_all_reduce function, no need to pass
tensors one by one.
   > 
   > done, it becomes
   > 
   > ```
   >                         self.fused_all_reduce(plist, send=False)
   >                         self.fused_all_reduce(plist)
   > ```
   
   
   
   
   The asychronous training is also okay now:
   ```
   root@71ac539cda77:~/dcsysh/singa/examples/autograd# python3 cifar10_multiprocess.py
   Starting Epoch 0:
   Training loss = 2910.528564, training accuracy = 0.335027
   Evaluation accuracy = 0.438902, Elapsed Time = 148.023807s
   Starting Epoch 1:
   Training loss = 2098.108887, training accuracy = 0.512984
   Evaluation accuracy = 0.558393, Elapsed Time = 147.847368s
   Starting Epoch 2:
   Training loss = 1696.253052, training accuracy = 0.612596
   Evaluation accuracy = 0.634615, Elapsed Time = 145.889916s
   Starting Epoch 3:
   Training loss = 1424.486328, training accuracy = 0.678877
   Evaluation accuracy = 0.725461, Elapsed Time = 144.518893s
   Starting Epoch 4:
   Training loss = 1202.804688, training accuracy = 0.732714
   Evaluation accuracy = 0.731070, Elapsed Time = 145.818738s
   STraining loss = 950.140442, training accuracy = 0.789133
   Evaluation accuracy = 0.761418, Elapsed Time = 144.987327s
   Starting Epoch 7:
   Training loss = 866.323792, training accuracy = 0.809579
   Evaluation accuracy = 0.746895, Elapsed Time = 146.025936s
   Starting Epoch 8:
   Training loss = 801.880615, training accuracy = 0.821523
   Evaluation accuracy = 0.830128, Elapsed Time = 143.801048s
   Starting Epoch 9:
   Training loss = 725.051636, training accuracy = 0.841169
   Evaluation accuracy = 0.836939, Elapsed Time = 145.036991s
   Starting Epoch 10:
   Training loss = 673.175293, training accuracy = 0.852393
   Evaluation accuracy = 0.844651, Elapsed Time = 143.906451s
   Starting Epoch 11:
   Training loss = 618.263550, training accuracy = 0.863936
   Evaluation accuracy = 0.824319, Elapsed Time = 143.375741s
   Starting Epoch 12:
   Training loss = 591.045410, training accuracy = 0.869418
   Evaluation accuracy = 0.848157, Elapsed Time = 143.443777s
   Starting Epoch 13:
   Training loss = 562.720825, training accuracy = 0.876521
   Evaluation accuracy = 0.840545, Elapsed Time = 143.486150s
   Starting Epoch 14:
   Training loss = 521.839844, training accuracy = 0.885563
   Evaluation accuracy = 0.849760, Elapsed Time = 145.186745s
   Starting Epoch 15:
   Training loss = 485.454468, training accuracy = 0.891225
   Evaluation accuracy = 0.863281, Elapsed Time = 146.288495s
   Starting Epoch 16:
   Training loss = 455.249939, training accuracy = 0.899728
   Evaluation accuracy = 0.873898, Elapsed Time = 145.327655s
   Starting Epoch 17:
   Training loss = 422.714111, training accuracy = 0.905450
   Evaluation accuracy = 0.865084, Elapsed Time = 144.338753s
   Starting Epoch 18:
   Training loss = 403.263367, training accuracy = 0.909271
   Evaluation accuracy = 0.856571, Elapsed Time = 144.544193s
   Starting Epoch 19:
   Training loss = 406.371643, training accuracy = 0.910131
   Evaluation accuracy = 0.825821, Elapsed Time = 145.991908s
   Starting Epoch 20:
   Training loss = 385.730377, training accuracy = 0.913312
   Evaluation accuracy = 0.887720, Elapsed Time = 146.428741s
   Starting Epoch 21:
   Training loss = 350.121643, training accuracy = 0.921975
   Evaluation accuracy = 0.876803, Elapsed Time = 146.140011s
   Starting Epoch 22:
   Training loss = 336.024078, training accuracy = 0.925196
   Evaluation accuracy = 0.883213, Elapsed Time = 143.576008s
   Starting Epoch 23:
   Training loss = 310.097626, training accuracy = 0.930998
   Evaluation accuracy = 0.885116, Elapsed Time = 146.495115s
   Starting Epoch 24:
   Training loss = 291.123596, training accuracy = 0.934959
   Evaluation accuracy = 0.894932, Elapsed Time = 145.639998s
   Starting Epoch 25:
   Training loss = 275.165466, training accuracy = 0.937720
   Evaluation accuracy = 0.859575, Elapsed Time = 147.771487s
   Starting Epoch 26:
   Training loss = 263.919128, training accuracy = 0.940941
   Evaluation accuracy = 0.898037, Elapsed Time = 144.860399s
   Starting Epoch 27:
   Training loss = 257.803558, training accuracy = 0.942182
   Evaluation accuracy = 0.896134, Elapsed Time = 144.756769s
   Starting Epoch 28:
   Training loss = 230.348862, training accuracy = 0.948744
   Evaluation accuracy = 0.898838, Elapsed Time = 146.492527s
   Starting Epoch 29:
   Training loss = 215.334015, training accuracy = 0.952305
   Evaluation accuracy = 0.896835, Elapsed Time = 146.046224s
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message