singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangwei (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SINGA-226) Add parallel training on a single machine for singa v1.0
Date Thu, 06 Oct 2016 11:55:22 GMT

     [ https://issues.apache.org/jira/browse/SINGA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

wangwei closed SINGA-226.
-------------------------
    Resolution: Fixed

> Add parallel training on a single machine for singa v1.0
> --------------------------------------------------------
>
>                 Key: SINGA-226
>                 URL: https://issues.apache.org/jira/browse/SINGA-226
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: Wang Ji
>            Assignee: Wang Ji
>
> In this ticket, we implement parallel training using multiple devices on a single machine.

> To support parallel training, a Updater class need to be implemented to aggregate partial
gradient from parallel workers and using Optimizer to update the Parameters. Updater can be
designed for different kinds of topological structure, i.e., *local-cpu*, *local-dev*, *local-allreduce*.

> *local-cpu:* Do aggregate and update parameter using CPU. In this mode, host CPU need
to copy gradient and parameter tensor from GPU workers, do update, and copy back.
> *local-gpu:* Do aggregate and update parameter using a chosen GPU. In this mode, the
updater GPU need to copy gradient and parameter tensor from other GPU workers, do update,
and copy back.
> *local-allreduce:* In this mode, each parameter will be sliced among all GPU workers.
In each iteration, gradients are aggregated and updated like a MPI Allreduce style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message