singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (SINGA-42) Issue when loading checkpoints
Date Mon, 03 Aug 2015 02:39:05 GMT


ASF subversion and git services commented on SINGA-42:

Commit a92a1c7786c2f65344f8f3ff1cfc4aa545724b09 in incubator-singa's branch refs/heads/master
from Wei Wang
[;h=a92a1c7 ]

SINGA-42 Issue when loading checkpoints

Update default value for ModelProto::reset_param_version to true.
It will reset all parameter version to ModelProto::step.

For resuming training from checkpoints, if users do not set it, the Trainer::Resume() function
will set it to false. Hence the parameter versions will continue from last checkpoint.
If users set it to true, then all parameters will be reset to the version as ModelProto::step.

If using the checkpoint as pre-training to initalize new model parameters,
users better use the default value (i.e., true),
otherwise some parameters' version would be much larger than others.

fixbug from calling Worker::Put() in Worker::InitLocalParam().

Previously, the version/step passed to Put() is step_ which starts from ModelProto::step.
Hence, the params in the servers are put with version step_.
When the params are load from checkpoint files and their param versions are not reset,
then the trainining may get stuck after one iteration.
Because if the start step is small (usually 0), parame version at the server
side is small (i.e., 0), while the local_version() (assigned from version()) is the one from
last checkpoint,
which is large. Hence the Worker::Collect() function will get stuck.
To fix this bug, just pass the current param->version() to Worker::Put().

remove hard code check for label value in

Fixbug from setting checkpoint file path in Worker::Resume().
Now the Worker::Resume() will clear checkpoint field in JobProto and add
the checkpoint files it finds under WORKSPACE/checkpoint/.

> Issue when loading checkpoints 
> -------------------------------
>                 Key: SINGA-42
>                 URL:
>             Project: Singa
>          Issue Type: Bug
>            Reporter: ZHAOJING
> When I try loading checkpoints of 4 pretrained RBM models in order to train a deep autoencoder,
the program is stuck. 
> The problem comes from reseting the version of params loaded from checkpoint file. After
we modify 
> "optional bool reset_param_version = 67 [default = false];" to "optional bool reset_param_version
= 67 [default = true];" 
> in src/proto/job.conf. The problem is resolved.

This message was sent by Atlassian JIRA

View raw message