singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SINGA-32) Implement AllReduce training framework
Date Sat, 18 Jul 2015 08:39:04 GMT

    [ https://issues.apache.org/jira/browse/SINGA-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632337#comment-14632337
] 

ASF subversion and git services commented on SINGA-32:
------------------------------------------------------

Commit 96bedb2264f7d4ebd8a2a0cad67dc9a91f5419c9 in incubator-singa's branch refs/heads/master
from wang wei
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=96bedb2 ]

SINGA-32 Implement synchronous training framework

Fix a bug from InitLocalParam() of Worker class.
One worker owns one Param if the param's data blob is not shared from other workers.
Previously, a Worker would not send Get request for one Param if it owns this Param.
But it may not init the Param locally because it is not the first group in a group
set which subscribe to the same server group.

To fix the bug, all workers would send Get requests for Params in its local layers.
There would not extra cost for getting Params owned by the worker (from the first group),
because the get reqest would not be sent (the param version is already the latest).


> Implement AllReduce training framework
> --------------------------------------
>
>                 Key: SINGA-32
>                 URL: https://issues.apache.org/jira/browse/SINGA-32
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: wangwei
>
> The AllReduce training framework runs in synchronous mode, where one worker starts the
next iteration after all workers have finished the previous iteration. Baidu's deepimage system
uses this training framework.
> To implement it in SINGA, we launch one worker group and one server group. The model
is partitioned (e.g., on dimension 0) among all workers. Params are sliced and partitioned
among all servers. 
> At the beginning, each Param (slice) is put into server shard including number of workers
computing gradient for it.
> For each iteration, the local stub aggregates all gradients for the same Param and sends
to corresponding server including the number of local workers computing gradient for it. The
server will buffer update requests and conducts update for a Param slice until it receives
gradients from all workers. It sends back the updated Param (slices) to the corresponding
process (stub).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message