singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangwei (JIRA)" <>
Subject [jira] [Created] (SINGA-132) Optimize training on a single node with GPUs
Date Tue, 12 Jan 2016 03:40:39 GMT
wangwei created SINGA-132:

             Summary: Optimize training on a single node with GPUs
                 Key: SINGA-132
             Project: Singa
          Issue Type: Improvement
            Reporter: wangwei
            Assignee: Haibo Chen

There are two training situations. 
1. a single worker. For this case, there is not need to launch a separate server thread. Because
it would lead to communication cost between the worker and server. Instead, we can create
an  Updater inside the Worker and call it to update the parameters locally inside the Worker.
The driver's working flow should be changed for this case, i.e., there is no need to have
a stub thread and server thread. The worker should run in the main thread and the program
terminates once the worker finishes.

2. multiple worker. For this case, we need both workers and servers. First, we can make zookeeper
an optional dependent library, as it is used for Job ID generation and termination condition
check. If no Job ID is available, we can always use the default Job ID (0). Since there is
only one process, we don't need zookeeper to know the status of workers in other processes.
Second, the communication between worker-stub-server should be optimized, e.g., using GPU-Direct.

This message was sent by Atlassian JIRA

View raw message