singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangwei (JIRA)" <>
Subject [jira] [Commented] (SINGA-132) Optimize training on a single node with GPUs
Date Sun, 13 Mar 2016 08:03:33 GMT


wangwei commented on SINGA-132:

SINGA-126 will resolve the first case.
For the second case, if these workers are launched in different processes, then we need Zookeeper
to coordinate them (e.g., for stopping). If all workers are in the same process, we do not
need Zookeeper. Instead, we can use the Stub thread to monitor the num of alive workers and
send msg to the servers once all workers have finished.

> Optimize training on a single node with GPUs
> --------------------------------------------
>                 Key: SINGA-132
>                 URL:
>             Project: Singa
>          Issue Type: Improvement
>            Reporter: wangwei
>            Assignee: wangwei
> There are two training situations. 
> 1. a single worker. For this case, there is not need to launch a separate server thread.
Because it would lead to communication cost between the worker and server. Instead, we can
create an  Updater inside the Worker and call it to update the parameters locally inside the
Worker. The driver's working flow should be changed for this case, i.e., there is no need
to have a stub thread and server thread. The worker should run in the main thread and the
program terminates once the worker finishes.
> 2. multiple worker. For this case, we need both workers and servers. First, we can make
zookeeper an optional dependent library, as it is used for Job ID generation and termination
condition check. If no Job ID is available, we can always use the default Job ID (0). Since
there is only one process, we don't need zookeeper to know the status of workers in other
processes. Second, the communication between worker-stub-server should be optimized, e.g.,
using GPU-Direct.

This message was sent by Atlassian JIRA

View raw message