From dev-return-3970-archive-asf-public=cust-asf.ponee.io@singa.apache.org Wed Dec 18 10:16:32 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 922C018065B for ; Wed, 18 Dec 2019 11:16:32 +0100 (CET) Received: (qmail 7937 invoked by uid 500); 18 Dec 2019 10:16:31 -0000 Mailing-List: contact dev-help@singa.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.apache.org Delivered-To: mailing list dev@singa.apache.org Received: (qmail 7927 invoked by uid 99); 18 Dec 2019 10:16:31 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Dec 2019 10:16:31 +0000 From: GitBox To: dev@singa.apache.org Subject: [GitHub] [singa] chrishkchris opened a new pull request #566: SINGA-487 Add Sparsification Algorithms Message-ID: <157666419156.5850.13631571269629694431.gitbox@gitbox.apache.org> References: In-Reply-To: Date: Wed, 18 Dec 2019 10:16:31 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit chrishkchris opened a new pull request #566: SINGA-487 Add Sparsification Algorithms URL: https://github.com/apache/singa/pull/566 This PR implements some sparsification schemes, we transfer only gradient elements which are significant. When we make use of cuda thrust parallel algorithm to convert the dense array into sparse array, the overhead is relatively low. It supports two mode, controlled by the flag topK: 1. When topK is False, it transmits the gradient elements which are greater than an absolute threshold value. 2. When topK is True, it transmits the K largest gradient element, where K equals the total number of elements multiplies the spars factor. Moreover, there is a flag corr to use the local accumulate gradient for correction. The flag is true by default, because it is common to use the local accumulate gradient correction in sparsification. Some reference papers for the Sparsification: [1] N. Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Proceedings of gpu cloud computing. In Proceedings of the InterSpeech 2015. International Speech Communication Association (ISCA), September 2015. [2] A. F. Aji and K. Hea eld. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 440{445. Association for Computational Linguistics (ACL), September 2017. I have added an examples file sparsification_mnist.py to test the accuracy. The following results is based on a 8 GPUs AWS instance p2.x8large of the GPU model K80. ``` ubuntu@ip-172-31-18-216:~/singa/examples/autograd$ python3 sparsification_mnist.py Starting Epoch 0: Training loss = 1237.824951, training accuracy = 0.537627 Evaluation accuracy = 0.831209, Elapsed Time = 1.364238s Starting Epoch 1: Training loss = 468.859161, training accuracy = 0.835053 Evaluation accuracy = 0.931229, Elapsed Time = 0.687484s Starting Epoch 2: Training loss = 329.488220, training accuracy = 0.887604 Evaluation accuracy = 0.949424, Elapsed Time = 0.713595s Starting Epoch 3: Training loss = 220.463303, training accuracy = 0.925731 Evaluation accuracy = 0.955592, Elapsed Time = 0.686450s Starting Epoch 4: Training loss = 171.178146, training accuracy = 0.942141 Evaluation accuracy = 0.961760, Elapsed Time = 0.686534s Starting Epoch 5: Training loss = 149.635681, training accuracy = 0.950237 Evaluation accuracy = 0.974198, Elapsed Time = 0.686791s Starting Epoch 6: Training loss = 124.092453, training accuracy = 0.958300 Evaluation accuracy = 0.973376, Elapsed Time = 0.686136s Starting Epoch 7: Training loss = 115.288582, training accuracy = 0.961205 Evaluation accuracy = 0.968647, Elapsed Time = 0.686174s Starting Epoch 8: Training loss = 99.048584, training accuracy = 0.966864 Evaluation accuracy = 0.981188, Elapsed Time = 0.685848s Starting Epoch 9: Training loss = 84.038574, training accuracy = 0.972239 Evaluation accuracy = 0.981188, Elapsed Time = 0.685568s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services