singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangwei (JIRA)" <>
Subject [jira] [Updated] (SINGA-132) Optimize training on a single node with GPUs
Date Sun, 13 Mar 2016 08:03:33 GMT


wangwei updated SINGA-132:
    Assignee: wangwei  (was: Haibo Chen)

> Optimize training on a single node with GPUs
> --------------------------------------------
>                 Key: SINGA-132
>                 URL:
>             Project: Singa
>          Issue Type: Improvement
>            Reporter: wangwei
>            Assignee: wangwei
> There are two training situations. 
> 1. a single worker. For this case, there is not need to launch a separate server thread.
Because it would lead to communication cost between the worker and server. Instead, we can
create an  Updater inside the Worker and call it to update the parameters locally inside the
Worker. The driver's working flow should be changed for this case, i.e., there is no need
to have a stub thread and server thread. The worker should run in the main thread and the
program terminates once the worker finishes.
> 2. multiple worker. For this case, we need both workers and servers. First, we can make
zookeeper an optional dependent library, as it is used for Job ID generation and termination
condition check. If no Job ID is available, we can always use the default Job ID (0). Since
there is only one process, we don't need zookeeper to know the status of workers in other
processes. Second, the communication between worker-stub-server should be optimized, e.g.,
using GPU-Direct.

This message was sent by Atlassian JIRA

View raw message