singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SINGA-12) Supprt Checkpoint and Restore
Date Tue, 28 Jul 2015 12:00:05 GMT

    [ https://issues.apache.org/jira/browse/SINGA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644280#comment-14644280
] 

ASF subversion and git services commented on SINGA-12:
------------------------------------------------------

Commit 06163950bff355ce3c83764ab51f07ee95993e09 in incubator-singa's branch refs/heads/master
from Wei Wang
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=0616395 ]

SINGA-12 Supprt Checkpoint and Restore

Fixbug from Resume function, which generated errors when irregular files are put into the
checkpoint folder.
Now irregular files will be reported and ignored.


> Supprt Checkpoint and Restore
> -----------------------------
>
>                 Key: SINGA-12
>                 URL: https://issues.apache.org/jira/browse/SINGA-12
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: Sheng Wang
>            Assignee: Sheng Wang
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> With the support of checkpoint, we can provide following features:
> 1. Failure Recovery: when a task is failed during the training, we can recover the task
from the latest checkpoint;
> 2. Continuous Training: when the user checks the trained model and finds that more steps
are needed, he can continue the training;
> 3. Parameter Reuse: from a previously trained model, we can create a new model by adding
new layers on top of it, and reuse the parameters during the training.
> The checkpoint should be done on the server side every few steps. In addition, a final
checkpoint will be made when the task is finished.
> During restore, the servers/workers will be firstly set up as normal, and after that
parameters will be loaded from the checkpoint file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message