From dev-return-3970-archive-asf-public=cust-asf.ponee.io@singa.apache.org  Wed Dec 18 10:16:32 2019
Return-Path: <dev-return-3970-archive-asf-public=cust-asf.ponee.io@singa.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 922C018065B
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 18 Dec 2019 11:16:32 +0100 (CET)
Received: (qmail 7937 invoked by uid 500); 18 Dec 2019 10:16:31 -0000
Mailing-List: contact dev-help@singa.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@singa.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@singa.apache.org>
List-Post: <mailto:dev@singa.apache.org>
List-Id: <dev.singa.apache.org>
Reply-To: dev@singa.apache.org
Delivered-To: mailing list dev@singa.apache.org
Received: (qmail 7927 invoked by uid 99); 18 Dec 2019 10:16:31 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Dec 2019 10:16:31 +0000
From: GitBox <git@apache.org>
To: dev@singa.apache.org
Subject: [GitHub] [singa] chrishkchris opened a new pull request #566: SINGA-487 Add
 Sparsification Algorithms
Message-ID: <157666419156.5850.13631571269629694431.gitbox@gitbox.apache.org>
References: <infra.566.MDExOlB1bGxSZXF1ZXN0MzQ5NTA1NTQ0.gitbox@gitbox.apache.org>
In-Reply-To: <infra.566.MDExOlB1bGxSZXF1ZXN0MzQ5NTA1NTQ0.gitbox@gitbox.apache.org>
Date: Wed, 18 Dec 2019 10:16:31 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

chrishkchris opened a new pull request #566: SINGA-487 Add Sparsification Algorithms
URL: https://github.com/apache/singa/pull/566
 
 
   This PR implements some sparsification schemes, we transfer only gradient elements which are significant. When we make use of cuda thrust parallel algorithm to convert the dense array into sparse array, the overhead is relatively low.
   
   It supports two mode, controlled by the flag topK:
   1. When topK is False, it transmits the gradient elements which are greater than an absolute threshold value.
   2. When topK is True, it transmits the K largest gradient element, where K equals the total number of elements multiplies the spars factor.  
   Moreover, there is a flag corr to use the local accumulate gradient for correction. The flag is true by default, because it is common to use the local accumulate gradient correction in sparsification.
   
   Some reference papers for the Sparsification:
   [1] N. Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Proceedings of gpu cloud computing. In Proceedings of the InterSpeech 2015. International Speech
   Communication Association (ISCA), September 2015.
   [2] A. F. Aji and K. Hea
   eld. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 440{445. Association for Computational Linguistics (ACL), September 2017.
   
   I have added an examples file sparsification_mnist.py to test the accuracy. The following results is based on a 8 GPUs AWS instance p2.x8large of the GPU model K80. 
   
   ```
   ubuntu@ip-172-31-18-216:~/singa/examples/autograd$ python3 sparsification_mnist.py
   Starting Epoch 0:
   Training loss = 1237.824951, training accuracy = 0.537627
   Evaluation accuracy = 0.831209, Elapsed Time = 1.364238s
   Starting Epoch 1:
   Training loss = 468.859161, training accuracy = 0.835053
   Evaluation accuracy = 0.931229, Elapsed Time = 0.687484s
   Starting Epoch 2:
   Training loss = 329.488220, training accuracy = 0.887604
   Evaluation accuracy = 0.949424, Elapsed Time = 0.713595s
   Starting Epoch 3:
   Training loss = 220.463303, training accuracy = 0.925731
   Evaluation accuracy = 0.955592, Elapsed Time = 0.686450s
   Starting Epoch 4:
   Training loss = 171.178146, training accuracy = 0.942141
   Evaluation accuracy = 0.961760, Elapsed Time = 0.686534s
   Starting Epoch 5:
   Training loss = 149.635681, training accuracy = 0.950237
   Evaluation accuracy = 0.974198, Elapsed Time = 0.686791s
   Starting Epoch 6:
   Training loss = 124.092453, training accuracy = 0.958300
   Evaluation accuracy = 0.973376, Elapsed Time = 0.686136s
   Starting Epoch 7:
   Training loss = 115.288582, training accuracy = 0.961205
   Evaluation accuracy = 0.968647, Elapsed Time = 0.686174s
   Starting Epoch 8:
   Training loss = 99.048584, training accuracy = 0.966864
   Evaluation accuracy = 0.981188, Elapsed Time = 0.685848s
   Starting Epoch 9:
   Training loss = 84.038574, training accuracy = 0.972239
   Evaluation accuracy = 0.981188, Elapsed Time = 0.685568s
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services