singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [singa-doc] chrishkchris edited a comment on issue #14: rearrange contents in dist-train.md
Date Sat, 04 Apr 2020 08:19:45 GMT
chrishkchris edited a comment on issue #14: rearrange contents in dist-train.md
URL: https://github.com/apache/singa-doc/pull/14#issuecomment-608993432
 
 
   > The [DIST](https://github.com/apache/singa/blob/master/examples/autograd/mnist_cnn.py#L153)
variable can be inferred based on the num of gpus?
   
   For MPI, I do not give the num of gpus (see the answer in the next question), so DIST variable
cannot be inferred in the case of MPI 
   
   > for MPI, you do not need to pass `num_gpus` explicitly to `DistOpt`? but for multiprocessing,
you need?
   
   For MPI, do not need to pass num_gpus, because this information is obtained from MPI  

   https://github.com/apache/singa/blob/dev/src/io/communicator.cc#L81
   `MPI_Comm_size(MPI_COMM_WORLD, &totalMPIRanksInGlobal)`
   However, multiprocess do not has this information, so we need to pass the num_gpus to let
the communicator knows
   
   > 
   > The format of the docString is very good!
   > Some arguments may need more explanations:
   > 
   > 1. [nccl_id] (https://github.com/apache/singa/blob/master/python/singa/opt.py#L191)
is compulsory for multiprocessing? and should be none for MPI?
   > 2. how about num_gpu and gpu_per_node?
   > 3. give a concrete example for `rank_in_local` and `rank_in_global`
   
   Yes, I will explain them in the docs.
   1. nccl_id is complusory for the initialization of nccl communicator in both MPI and multiprocess
in our code, here is the place which needs the id
   https://github.com/apache/singa/blob/dev/src/io/communicator.cc#L108
   `ncclCommInitRank(&comm, totalMPIRanksInGlobal, id, MPIRankInGlobal));`
   2. num_gpu and gpu_per_node is required by multiprocess, but MPI does not need it because
it is provided by mpich function: 
   https://github.com/apache/singa/blob/dev/src/io/communicator.cc#L81
   3. rank_in_local is the rank (it tells which GPU the process/script is using) within the
same node, rank_in_global (log is at rank 0 only, and it tells which part of dataset to take)
is the rank in all the nodes
   
   > In addition, we may need to introduce the implementation of distributed training code
in SINGA at the end of this documentation. We have given the overview of the synchronous training
algorithm at the beginning in this documentation. But how what is done at the Python side
and C++ side is unknown. When NCCL and MPI APIs are called. This part is mainly for developers
(not for end users).
   
   got it, thanks. Will explain the implementation

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message