singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [singa-doc] chrishkchris edited a comment on issue #14: rearrange contents in
Date Sat, 04 Apr 2020 09:53:05 GMT
chrishkchris edited a comment on issue #14: rearrange contents in
   > Is nccl_id passed to `train_mnist_cnn` for training with MPI?
   No, in the case of MPI, nccl_id is generated and immediately broadcasted by MPI from rank0
to every rank. 
     if (MPIRankInGlobal == 0) ncclGetUniqueId(&id);
     MPICHECK(MPI_Bcast((void *)&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));
   However, for multiprocess what I can do is to generate it at the beginning and pass it
to python multiprocess function
   ncclid is like a ticket, where only the process with the ticket can join the allreduce
   > For multiprocessing, num_gpu = gpu_per_node, hence we only need num_gpu?
   num_gpu is the local rank of a specific process
   gpu_per_node is the total number of ranks in a single node
   for gpu_num in range(0, gpu_per_node):        
           process.append(multiprocessing.Process(target=train_mnist_cnn, args=(sgd, max_epoch,

          batch_size, True, data_partition, gpu_num, gpu_per_node, nccl_id)))

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message