singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (SINGA-32) Implement AllReduce training framework
Date Sat, 18 Jul 2015 08:39:04 GMT


ASF subversion and git services commented on SINGA-32:

Commit 585e275fdf050db25eb9c583fb54ae39714d9b20 in incubator-singa's branch refs/heads/master
from wang wei
[;h=585e275 ]

SINGA-32 Implement Synchronous training frameworks

For the synchronous training frameworks, one worker group and one server group are launched.
Gradients for the same Param are aggregated locally at each process's stub.
The server conducts update until receive all gradients for the same Param (slice).
After udpate, the server sends back new Param (slice) values to every process who has sent
update request.
The worker_shard_ and server_shard consist of ParamEntrys, each of which stores the information
of one unique Param (slice), e.g.,
the number of shares of each Param (slice), and the local shares for each Param (slice).

The Msg class is improved to have clean/simple API. The msg header now includes a src (int),
a dst (int) and a trgt (int value and int version),
representing the source addr, destination addr and target of the msg. The address is constructed
by the
entity who creates the msg. Any addr is valid as long as it is unique for one entity.
Function Addr(int grp, int id_or_proc, int type) is provided to construct the addr using
group ID, worker/server ID (or procs ID) and entity type (kServer, kStub, etc.). Functions
are also provided to extract
the group, worker/server ID from the addr (int). Similarly, the target field can be constructed
using ParamTrgt function
which wraps the Param ID and Slice ID into a target value (int). ParamID() and SliceID() are
to extract the info from target value.

> Implement AllReduce training framework
> --------------------------------------
>                 Key: SINGA-32
>                 URL:
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: wangwei
> The AllReduce training framework runs in synchronous mode, where one worker starts the
next iteration after all workers have finished the previous iteration. Baidu's deepimage system
uses this training framework.
> To implement it in SINGA, we launch one worker group and one server group. The model
is partitioned (e.g., on dimension 0) among all workers. Params are sliced and partitioned
among all servers. 
> At the beginning, each Param (slice) is put into server shard including number of workers
computing gradient for it.
> For each iteration, the local stub aggregates all gradients for the same Param and sends
to corresponding server including the number of local workers computing gradient for it. The
server will buffer update requests and conducts update for a Param slice until it receives
gradients from all workers. It sends back the updated Param (slices) to the corresponding
process (stub).

This message was sent by Atlassian JIRA

View raw message