singa-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wang...@apache.org
Subject svn commit: r1692534 - in /incubator/singa/site/trunk/content: markdown/docs/checkpoint.md markdown/docs/data.md markdown/docs/examples.md site.xml
Date Fri, 24 Jul 2015 15:07:10 GMT
Author: wangwei
Date: Fri Jul 24 15:07:09 2015
New Revision: 1692534

URL: http://svn.apache.org/r1692534
Log:
Add docs for checkpoint

Added:
    incubator/singa/site/trunk/content/markdown/docs/checkpoint.md
Modified:
    incubator/singa/site/trunk/content/markdown/docs/data.md
    incubator/singa/site/trunk/content/markdown/docs/examples.md
    incubator/singa/site/trunk/content/site.xml

Added: incubator/singa/site/trunk/content/markdown/docs/checkpoint.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/checkpoint.md?rev=1692534&view=auto
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/checkpoint.md (added)
+++ incubator/singa/site/trunk/content/markdown/docs/checkpoint.md Fri Jul 24 15:07:09 2015
@@ -0,0 +1,98 @@
+## Checkpoint and Resume
+
+___
+
+### Applications of checkpoint
+
+By taking checkpoints of model parameters, we can
+
+  1. Restore (resume) the training from the last checkpoint. For example, if
+    the program crashes before finishing all training steps.
+
+  2. Use them as pre-training results for a similar model. For example, the
+    parameters from training a RBM model can be used to initialize
+    a [deep auto-encoder](auto-encoder.html) model.
+
+
+### Instructions for checkpoint and resume
+
+Checkpoint is controlled by two model configuration fields:
+`checkpoint_after` (start checkpoint after this number of training steps)
+and `checkpoint_frequency`. The checkpoint files are located
+at `WORKSPACE/checkpoint/stepSTEP-workerWORKERID.bin`.
+
+The following configuration shows an example,
+
+    model {
+      ...
+      checkpoint_after: 100
+      checkpoint_frequency: 300
+      ...
+    }
+
+After training for 700 steps, under WORKSPACE/checkpoint folder, there would be
+two checkpoint files (training on single node):
+
+    step400-worker0.bin
+    step700-worker0.bin
+
+#### Application 1
+We can resume the training from the last checkpoint (i.e., step 700) by:
+
+    ./bin/singa-run.sh -workspace=WORKSPACE -resume=true
+
+#### Application 2
+
+We can also use the checkpoint file from step 400 as the
+pre-trained model for a new model by configuring the
+job.conf of the new model as:
+
+    model {
+      ...
+      checkpoint : WORKSPACE/checkpoint/step400-worker0.bin
+      ...
+    }
+
+If there are multiple checkpoint files for the same snapshot due to model
+partitioning, all the checkpoint files should be added:
+
+    model {
+      ...
+      checkpoint : WORKSPACE/checkpoint/step400-worker0.bin
+      checkpoint : WORKSPACE/checkpoint/step400-worker1.bin
+      ...
+    }
+
+
+The launching command is the same as starting a new job
+
+    ./bin/singa-run.sh -workspace=WORKSPACE
+
+
+### Implementation details
+
+The checkpoint is done in the Worker class and controlled by two model
+configuration fields: `checkpoint_after` and `checkpoint_frequency`.
+Only Params owning the param values from the first group are dumped onto into
+checkpoint files. For one Param object, its name, version and values are saved.
+It is possible that the snapshot is separated
+into multiple files because the neural net is partitioned into multiple workers.
+
+The Worker's InitLocalParam will initialize Params from checkpoint files if the
+`checkpoint` field is set. Otherwise it randomly initialize them using user
+configured initialization method.  The Param objects are matched based on name.
+If the Param is not configured with a name, NeuralNet class will automatically
+create one for it based on the name of the layer to which the Param object belongs.
+The `checkpoint` can be set by users (Application 1) or by the Resume function
+(Application 2) of the Trainer class, which finds the files for the latest
+snapshot and add them to the `checkpoint` filed. It also sets the `step` field
+of model configuration to the checkpoint step (extracted from file name).
+
+
+### Caution
+
+Both two applications must be taken carefully when Param objects are
+partitioned due to model partitioning. Because if the training is done using 2
+workers, while the new model (or continue training) is trained with 3 workers,
+then the same original Param object is partitioned in different ways and hence
+cannot be matched.

Modified: incubator/singa/site/trunk/content/markdown/docs/data.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/data.md?rev=1692534&r1=1692533&r2=1692534&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/data.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs/data.md Fri Jul 24 15:07:09 2015
@@ -0,0 +1,18 @@
+## Data Preparation
+
+To submit a training job, users need to convert raw data (e.g., images, text
+documents) into records that can be recognized by SINGA. SINGA uses a DataLayer
+to load these records into memory and uses ParserLayer to parse features (e.g.,
+image pixels and labels) from these records. The records could be organized and
+stored using many different ways, e.g., using a light database, or a file or
+HDFS, as long as there is a corresponding DataLayer that can load the records.
+
+### DataShard
+
+
+
+### LMDB
+
+
+
+### HDFS

Modified: incubator/singa/site/trunk/content/markdown/docs/examples.md
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/markdown/docs/examples.md?rev=1692534&r1=1692533&r2=1692534&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/markdown/docs/examples.md (original)
+++ incubator/singa/site/trunk/content/markdown/docs/examples.md Fri Jul 24 15:07:09 2015
@@ -1,170 +1,28 @@
-Title:
-Notice:    Licensed to the Apache Software Foundation (ASF) under one
-           or more contributor license agreements.  See the NOTICE file
-           distributed with this work for additional information
-           regarding copyright ownership.  The ASF licenses this file
-           to you under the Apache License, Version 2.0 (the
-           "License"); you may not use this file except in compliance
-           with the License.  You may obtain a copy of the License at
-           .
-             http://www.apache.org/licenses/LICENSE-2.0
-           .
-           Unless required by applicable law or agreed to in writing,
-           software distributed under the License is distributed on an
-           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-           KIND, either express or implied.  See the License for the
-           specific language governing permissions and limitations
-           under the License.
-
-Here are the examples of SINGA, including MLP, CNN, RBM and RNN models. This tutorial will
show you some basic information about how to configure SINGA.
-To run a SINGA job, you need to configure two files, model.conf to specify  the deep learning
model and cluster.conf to define the distributed training architecture.
-
-model.conf
-====
-model.conf is the file that configures the deep learning model you want to train. 
-It should contain the neurualnet structure, training algorithm(backforward or contrastive
divergence etc.), 
-SGD update algorithm(e.g. Adagrad), number of training/test steps and training/test frequency,

-and display features and etc. 
-SINGA will read model.conf as a Google protobuf class 
-[ModelProto](https://github.com/apache/incubator-singa/blob/master/src/proto/model.proto).

-Here is a simple example simplified from our [MLP example](https://github.com/apache/incubator-singa/blob/master/examples/mnist/model.conf):
-
-    name: "simple-mlp"
-    train_steps: 1000
-    test_steps:10
-    test_frequency:60
-    display_frequency:30
-    alg: kBackPropagation
-    updater{
-      base_lr: 0.001
-      lr_change: kStep
-      type: kSGD
-      step_conf{
-        change_freq: 60
-        gamma: 0.997
-      }
-    }
-
-    neuralnet {
-    layer {
-      name: "data"
-      type: kShardData
-      sharddata_conf {
-        path: "examples/mnist/mnist_train_shard"
-        batchsize: 1000
-      }
-      exclude: kTest
-    }
-
-    layer {
-      name: "data"
-      type: kShardData
-      sharddata_conf {
-        path: "examples/mnist/mnist_test_shard"
-        batchsize: 1000
-      }
-      exclude: kTrain
-    }
-
-    layer{
-      name:"mnist"
-      type: kMnist
-      srclayers: "data"
-      mnist_conf {
-        norm_a: 127.5
-        norm_b: 1
-      }
-    }
-
-    layer{
-      name: "label"
-      type: kLabel
-      srclayers: "data"
-    }
-
-    layer{
-      name: "fc"
-      type: kInnerProduct
-      srclayers:"mnist"
-      innerproduct_conf{
-        num_output: 2500
-      }
-      param{
-        name: "weight"
-        init_method: kUniform
-        low:-0.05
-        high:0.05
-      }
-      param{
-        name: "bias"
-        init_method: kUniform
-        low: -0.05
-        high:0.05
-      }
-    }
-
-    layer{
-      name: "tanh"
-      type: kTanh
-      srclayers:"fc1"
-    }
-
-    layer{
-      name: "pre-softmax"
-      type: kInnerProduct
-      srclayers:"tanh1"
-      innerproduct_conf{
-        num_output: 2000
-      }
-      param{
-        name: "weight"
-        init_method: kUniform
-        low:-0.05
-        high:0.05
-      }
-      param{
-        name: "bias"
-        init_method: kUniform
-        low: -0.05
-        high:0.05
-      }
-    }
-
-    layer{
-      name: "loss"
-      type:kSoftmaxLoss
-      softmaxloss_conf{
-        topk:1
-      }
-      srclayers:"pre-softmax"
-      srclayers:"label"
-    }
-    }
-
-In this example, we define a neuralnet that contains one hidden layer. fc+tanh is the hidden
layer(fc is for the inner product part, and tanh is for the non-linear activation function),
and the final softmax layer is represented as pre-softmax+loss (inner product and softmax).
For each layer, we define its name, input layer(s), basic configurations (e.g. number of nodes,
parameter initialization settings). 
-You can also get more details about[programming model](http://singa.incubator.apache.org/docs/programming-model.html)
from our website.
-
-cluster.conf
-====
-cluster.conf is the file that configures the distributed architecture you want to use. 
-SINGA will read cluster.conf as a Google protobuf class [ClusterProto](https://github.com/apache/incubator-singa/blob/master/src/proto/cluster.proto).

-By configuring cluster.conf, you can let SINGA run in single machine, Sandblaster, Downpour,
Hogwild, AllReduce mode and etc.
-The details about architecture settings are described in [System Architecture](http://singa.incubator.apache.org/docs/architecture.html)
in our website. Below is a basic single machine configuration:
-
-
-    nworker_groups: 1
-    nserver_groups: 1
-    nservers_per_group: 1
-    nworkers_per_group: 1
-    nservers_per_procs: 1
-    nworkers_per_procs: 1
-    workspace: "examples/mnist/"
-
-
-List of examples
-====
-* [MLP using MNIST](http://singa.incubator.apache.org/docs/mlp.html)
-  - A simple backforward model : multilayer perception.
-* [CNN using CIFAR10](http://singa.incubator.apache.org/docs/cnn.html)
-  - A convolutional nereual network example, using more types of layers.
- 
+## Example Models
+
+Different models are provided as examples to help users get familiar with SINGA.
+[Neural Network](neuralnet.html) gives details on the models that are
+supported by SINGA.
+
+
+### Feed-forward neural networks
+
+  * [MultiLayer Perceptron](mlp.html) trained on MNIST dataset for handwritten
+  digits recognition.
+
+  * [Convolutional Neural Network](cnn.html) trained on MNIST and CIFAR10 for
+  image classification.
+
+  * [Deep Auto-Encoders](auto-encoder.html) trained on MNIST for dimensionality
+
+
+### Recurrent neural networks (RNN)
+
+ * [RNN language model](rnn.html) trained on Penn treebank corpus for language
+ modeling.
+
+### Energy models
+
+ * [RBM](rbm.html) used to pre-train deep auto-encoders for dimensionality
+ reduction.
+

Modified: incubator/singa/site/trunk/content/site.xml
URL: http://svn.apache.org/viewvc/incubator/singa/site/trunk/content/site.xml?rev=1692534&r1=1692533&r2=1692534&view=diff
==============================================================================
--- incubator/singa/site/trunk/content/site.xml (original)
+++ incubator/singa/site/trunk/content/site.xml Fri Jul 24 15:07:09 2015
@@ -60,19 +60,11 @@
         <item name="Neural Network" href="docs/neuralnet.html"/>
         <item name="Layer" href="docs/layer.html"/>
       </item>
-      <item name = "Data Preparation" href = "docs/data.html">
-        <item name = "DataShard" href = "docs/datashard.html"/>
-        <item name = "LMDB" href = "docs/lmdb.html"/>
-        <item name = "HDFS" href = "docs/hdfs.html"/>
-      </item>
+      <item name = "Data Preparation" href = "docs/data.html"/>
+      <item name = "Checkpoint" href = "docs/checkpoint.html"/>
       <item name="System Architecture" href="docs/architecture.html"/>
       <item name="Communication" href="docs/communication.html"/>
-      <item name="Examples" href="docs/examples.html">
-        <item name="MLP" href="docs/mlp.html"/>
-        <item name="CNN" href="docs/cnn.html"/>
-        <item name = "RBM" href="docs/rbm.html"/>
-        <item name = "RNN" href="docs/rnn.html"/>
-      </item>
+      <item name="Examples" href="docs/examples.html"/>
     </menu>
 
     <menu name="Development">



Mime
View raw message