From dev-return-4913-archive-asf-public=cust-asf.ponee.io@singa.apache.org  Mon Apr  6 03:50:46 2020
Return-Path: <dev-return-4913-archive-asf-public=cust-asf.ponee.io@singa.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8FB74180638
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  6 Apr 2020 05:50:46 +0200 (CEST)
Received: (qmail 90548 invoked by uid 500); 6 Apr 2020 03:50:45 -0000
Mailing-List: contact dev-help@singa.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@singa.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@singa.apache.org>
List-Post: <mailto:dev@singa.apache.org>
List-Id: <dev.singa.apache.org>
Reply-To: dev@singa.apache.org
Delivered-To: mailing list dev@singa.apache.org
Received: (qmail 90535 invoked by uid 99); 6 Apr 2020 03:50:44 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Apr 2020 03:50:44 +0000
From: GitBox <git@apache.org>
To: dev@singa.apache.org
Subject: [GitHub] [singa-doc] chrishkchris commented on a change in pull request #16:
 Add more details in the explanation of dist-train.md
Message-ID: <158614504483.2447.8375068919011507226.gitbox@gitbox.apache.org>
References: <infra.16.MDExOlB1bGxSZXF1ZXN0Mzk5MzQ4Nzgx.gitbox@gitbox.apache.org>
In-Reply-To: <infra.16.MDExOlB1bGxSZXF1ZXN0Mzk5MzQ4Nzgx.gitbox@gitbox.apache.org>
Date: Mon, 06 Apr 2020 03:50:44 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

chrishkchris commented on a change in pull request #16: Add more details in the explanation of dist-train.md
URL: https://github.com/apache/singa-doc/pull/16#discussion_r403812104
 
 
 ##########
 File path: docs-site/docs/dist-train.md
 ##########
 @@ -145,30 +165,80 @@ if __name__ == '__main__':
     nccl_id = singa.NcclIdHolder()
 
     # Define the number of GPUs to be used in the training process
-    gpu_per_node = int(sys.argv[1])
-    gpu_num = 1
+    num_gpus = int(sys.argv[1])
 
     # Define and launch the multi-processing
 	import multiprocessing
     process = []
-    for gpu_num in range(0, gpu_per_node):
+    for gpu_num in range(0, num_gpus):
         process.append(multiprocessing.Process(target=train_mnist_cnn,
-                       args=(nccl_id, gpu_num, gpu_per_node)))
+                       args=(nccl_id, gpu_num, num_gpus)))
 
     for p in process:
         p.start()
 ```
 
+Here are some explanations concerning the variables created above:
+
+(i) `nccl_id`
+
+Note that we need to generate a NCCL ID here to be used for collective communication, and then pass it to all the processes. 
+The NCCL ID is like a ticket, where only the processes with this ID can join the AllReduce operation. 
+(Later if we use MPI, the passing of NCCL ID is not necessary, because the ID is broadcased by MPI in our code automatically)
+
+(ii) `num_gpus`
 
 Review comment:
   yes, can use <local_rank, global_rank, world size> in every locations of python and C++? num_gpus will be changed to world size

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services