singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [singa] XJDKC opened a new pull request #626: [WIP] SINGA-505 Computational graph with memory optimization
Date Wed, 11 Mar 2020 11:47:50 GMT
XJDKC opened a new pull request #626: [WIP] SINGA-505 Computational graph with memory optimization
   # Overview
   This PR adds the computational graph with memory optimization. It is based on the code
developed by @chrishkchris and @XJDKC and some discussions with @nudles.
   # Features
   There are three main features in this PR, namely the construction of the computational
graph, lazy allocation and automatic recycling. Details as follows:
   * `Computational graph construction`: Construct a computational graph based on the user-defined
neural network or expressions and then run the graph to accomplish the training task.
   * `Lazy allocation`: When blocks need to be allocated, devices won't allocate memory for
them immediately. Only when an operation uses this block for the first time, memory allocation
will be performed.
   * `Automatic recycling`: Automatically deallocate the intermediate tensors which won't
be used again in the following operations when we are running the graph in an iteration.
   # Design
   1. Computational graph construction
       * Use the technique of delayed execution to falsely perform operations in forward propagation
and backward propagation once. Buffer all the operations and the tensors read or written by
each operation. 
       * Calculate dependencies between all the operations to decide the order of execution.
(Support directed cyclic graph)
       * Execute all the operations in the order we just calculated to update all the parameters.
   2. Lazy allocation
       * When a device needs to create a new block, just pass the device to that block instead
of allocating a piece of memory from the mempool and passing the pointer to that block.
       * When the block is accessed for the first time, let the device corresponding to the
block allocate memory and then access it.
   3. Automatic recycling
       * When calculating dependencies between the operations during the graph construction,
the reference count of tensors can also be calculated.
       * When an operation is completed, we can decrease the reference count of tensors the
operation used.
       * If a tensor's reference count reaches zero, it means the tensor won't be accessed
by latter operations and we can recycle its memory.
   # Changes
   * `Tensor`&`Operation`
       * Change the capture type of tensors in lambda expressions to achieve delayed execution.
       * Change the type of input and output parameters to ensure that the input and output
of the operation are tensors.
   * `Device`: Add code for 
       * buffering operations
       * constructing graph
       * calculating dependencies
       * executing graph.
   * `Block`: Add a member variable of type device to help to do the lazy allocation. Add
a function to help to do the automatic recycling.
   * `Swig`: add some interfaces
   *  `Examples`: Add some examples with operations buffering.
   # Evaluation
   * Experiment settings
       * Model: ResNet50 in [](../tree/dev/examples/autograd/
       * GPU: Nvidia RTX 2080Ti
   * Result: `s =  second` `b = batch`
           <th style="text-align: center">Batchsize</th>
           <th style="text-align: center">Cases</th>
           <th style="text-align: center">Memory-Usage(peak)</th>
           <th style="text-align: center">Throughput</th>
           <th style="text-align: center">Time</th>
           <th style="text-align: center">Reduction Rate</th>
           <th style="text-align: center">Speedup</th>
           <td rowspan="3">16</td>
           <td>dev branch</td>
           <td>PR(no graph)</td>
           <td>PR(with graph)</td>
           <td rowspan="3">32</td>
           <td>dev branch</td>
           <td>PR(no graph)</td>
           <td>PR(with graph)</td>
   From the table above, we can know that:
   * This PR does not affect training time and memory usage if the graph is disabled (has
backward compatibility).
   * This PR can significantly reduce memory usage and training time by using the graph.
   # How to use
   # Initialize the input tensors
   # ...
   # Buffer the operations
   x = autograd.matmul(inputs, w0)
   x = autograd.add_bias(x, b0)
   x = autograd.relu(x)
   x = autograd.matmul(x, w1)
   x = autograd.add_bias(x, b1)
   # x = autograd.softmax(x)
   loss = autograd.softmax_cross_entropy(x, target)
   for p, gp in autograd.backward(loss):
           sgd.apply(0, gp, p, "")
   # Run Graph
   print("start executing buffered functions")
   for i in range(1001):
   # Plan
   - [ ] Computation graph optimization: replace a subgraph of the input computation graph
with another subgraph which is functionally equivalent to the original one. 
   - [ ] Performing operations in parallel.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message