From dev-return-4913-archive-asf-public=cust-asf.ponee.io@singa.apache.org Mon Apr 6 03:50:46 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8FB74180638 for ; Mon, 6 Apr 2020 05:50:46 +0200 (CEST) Received: (qmail 90548 invoked by uid 500); 6 Apr 2020 03:50:45 -0000 Mailing-List: contact dev-help@singa.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.apache.org Delivered-To: mailing list dev@singa.apache.org Received: (qmail 90535 invoked by uid 99); 6 Apr 2020 03:50:44 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Apr 2020 03:50:44 +0000 From: GitBox To: dev@singa.apache.org Subject: [GitHub] [singa-doc] chrishkchris commented on a change in pull request #16: Add more details in the explanation of dist-train.md Message-ID: <158614504483.2447.8375068919011507226.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Mon, 06 Apr 2020 03:50:44 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit chrishkchris commented on a change in pull request #16: Add more details in the explanation of dist-train.md URL: https://github.com/apache/singa-doc/pull/16#discussion_r403812104 ########## File path: docs-site/docs/dist-train.md ########## @@ -145,30 +165,80 @@ if __name__ == '__main__': nccl_id = singa.NcclIdHolder() # Define the number of GPUs to be used in the training process - gpu_per_node = int(sys.argv[1]) - gpu_num = 1 + num_gpus = int(sys.argv[1]) # Define and launch the multi-processing import multiprocessing process = [] - for gpu_num in range(0, gpu_per_node): + for gpu_num in range(0, num_gpus): process.append(multiprocessing.Process(target=train_mnist_cnn, - args=(nccl_id, gpu_num, gpu_per_node))) + args=(nccl_id, gpu_num, num_gpus))) for p in process: p.start() ``` +Here are some explanations concerning the variables created above: + +(i) `nccl_id` + +Note that we need to generate a NCCL ID here to be used for collective communication, and then pass it to all the processes. +The NCCL ID is like a ticket, where only the processes with this ID can join the AllReduce operation. +(Later if we use MPI, the passing of NCCL ID is not necessary, because the ID is broadcased by MPI in our code automatically) + +(ii) `num_gpus` Review comment: yes, can use in every locations of python and C++? num_gpus will be changed to world size ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services