singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sheng Wang (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SINGA-119) Remove job registration before launching the training program
Date Tue, 13 Dec 2016 07:10:58 GMT

     [ https://issues.apache.org/jira/browse/SINGA-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sheng Wang closed SINGA-119.
----------------------------
    Resolution: Invalid

> Remove job registration before launching the training program
> -------------------------------------------------------------
>
>                 Key: SINGA-119
>                 URL: https://issues.apache.org/jira/browse/SINGA-119
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: Sheng Wang
>
> Job registration, including getting the job ID, is necessary for training in a cluster.
It is done in the `bin/singa-run.sh` script and before ssh to each node to invoke the training
program.
> For some situations, e.g, a small model or a single node (with multiple GPU cards), users
do not need to train the model on multiple nodes. Many models can be trained on a single node
(process) with multiple GPU cards. In this case, it would be better to remove the Job registration
step to make job launching simple. For instance, users can start the training by
> {code}
> ./singa -conf examples/cifar10/job.conf
> {code}
> or via python script SINGA-81
> {code}
> python tool/python/examples/cifar10.py
> {code}
> The Job ID is determined inside the program by cluster_rt.cc, which communicates with
the zookeeper server. We may later make zookeeper an optional dependency for training in a
single node, as it is mainly used for generating a unique job ID.
> For an extreme case where there is a single worker, we do not need to create a server
thread. In fact, we can create an Updater instance inside the worker, which updates the parameters
locally. It would speed up the training on a single GPU card, because we do not need to transfer
the gradients and parameters between the worker and the server. Currently, we have to transfer
the gradients from worker (GPU memory) to the server (CPU memory), which is time consuming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message