singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wang Wei <wang...@comp.nus.edu.sg>
Subject Re: Caffe con Troll and SINGA?
Date Sat, 25 Jul 2015 03:40:26 GMT
Hi Ce,

Thanks for your explanation.
I read the example training configuration of CcT on github, and have some
further questions on the parallelization.

How do you parallelize the training on CPU and GPU, by creating two threads
each with one (Caffe) Solver?
Or does each Layer partition the mini-batch and dispatch them onto two
threads (one using GPU Driver and one using CPU Driver) after receiving the
Pointer?

According to the configuration file (
https://github.com/nudles/CaffeConTroll/blob/master/tests/imagenet_train/train_val/alexnet_train_val_1GPU_CPU.prototxt),

it seems the parallelization is at the Layer level (i.e, the second one).
In specific, given two connected layers A->B (e.g., conv and relu).
If the batch partitioning for A is 0.6 on GPU and 0.4 on CPU, while the
partitioning for B is 1.0 (ie.. all) on GPU.
Then the computation for B should be blocked until the computation for A is
finished.
Consequently, the synchronization is done for every layer.

For SINGA, we use the worker-server architecture (similar to parameter
server).
Currently, we support both synchronous and asynchronous training on CPU.
We call the training framework implemented by CcT as synchronous training.
SINGA implements this training framework by partitioning the neural network
among one worker group.  One worker runs in a thread.
The partitioning is currently done at Layer level. Users can configure it
to be on dimension 0  or 1 of the feature blob.
For dimension 0, it means partitioning one mini-batch onto workers like CcT
(but we use equal partitioning).
For dimension 1, it means partitioning one layer (with 4096 neurons) into
sub-layers (with 1024 neurons if there are 4 workers).
After partitioning, each (sub) layer will be assigned a location ID, i.e.,
the ID of the worker to which the (sub) layer will be dispatched.
We will support partitioning at the neural network level, i.e., let users
to configure the location ID of one layer.
During training, each worker it has the full neural network structure, but
it only visits (e.g., forward pass) the layers that are dispatched to him
(based on Location ID and worker ID).


To Support CcT in SINGA,
1. We first need to support GPU training (should be done in August).
2. Update the neural network partitioning function to integrate CcT's
scheduling strategy.
2. Make the Worker class a template, and create GPU workers and CPU workers
after partitioning the neural network.
3. I think the Lowering, Shifting techniques are easy to integrate as a
library if it is independent of the devices.

Do I miss any other features of CcT?

Regards,
Wei

A Layer takes as input a Pointer, and calls the `dereference` function of
> Pointer to
> obtain local-copy (w.r.t. the driver) of the data. Therefore, the Layer
> object does not
> know where the input data comes from.
>
> To run operations like GEMM, Lowering, a Layer will call Driver, which
> provided
> a unified interface across devices.
>
> The Layer, Pointer and Driver abstractions are clear and easy to
understand.

> I think it is possible for you to compile each layer as a library that you
> can call that
> takes as input a pointer object and fills in another pointer object as
> output.
>
> >> How do you synchronize the parameters trained on CPU and GPU, using
> the implementation of Hogwild from Caffe?
>
> Currently, we parallelize inside a batch--If the batch size is 256, we
> might put 200 of them on GPU, 56 of
> them on CPU. After their result comes back, we aggregate the result. This
> means that our current result
> is exactly the same as a single-thread run.
>
>


> For AlexNet with 256 batch size (the one most paper used), we observe this
> strategy gives almost
> linear speed up even with 4 Amazon EC2 GPUs.
>
> Of course Hogwild! or parameter servers are a natural direction to further
> scale up the current system
> when the number of computational devices further increase and the
> aggregation time starts to
> dominate...
>
> >> Which part of Caffe you have changed (we also borrow some code from
> Caffe, thus know its structure)?
>
> We borrow the parser code, loader code, and protobuf code. The main reason
> is to make sure CcT
> is compatible with Caffe. I think we rewrite most other layers, especially
> for CONV. For faster layers
> like ReLU, our code are very similar to them.
>
> Let us know if you have any questions!
>
> Ce
>
> On Wed, Jul 22, 2015 at 10:49 PM, Wang Wei <wangwei@comp.nus.edu.sg>
> wrote:
>
>>
>>
>> On Thu, Jul 23, 2015 at 11:47 AM, Wang Wei <wangwei@comp.nus.edu.sg>
>> wrote:
>>
>>> Hi Ce,
>>>
>>> Thanks for starting the discussion.
>>>
>>> We are preparing documentations and test for our first Apache Incubator
>>> release.
>>> I planed to contact Stefan after the first release. Because the first
>>> release does not support GPU.
>>> There are some developers in NetEase working on GPU implementation,
>>> which will be integrated for the second release.
>>>
>>> It is a good time to discuss the integration now. We can consider this
>>> new feature while implementing the GPU version.
>>> Currently, we use Blob (from caffe) to manage memory and Mshadow for
>>> computation.
>>> To integerate Caffe on Troll, the ideal case is making Cafee on Troll a
>>> library like Mshadow.
>>> I think at least the convolution optimization techniques (lowering,
>>> multiply, lifting) could be compiled as a library (correct?).
>>>
>>> How do you manage the memory across CPU and GPU?
>>> How do you synchronize the parameters trained on CPU and GPU, using the
>>> implementation of Hogwild from Caffe?
>>> Which part of Caffe you have changed (we also borrow some code from
>>> Caffe, thus know its structure)?
>>>
>>> Thank you.
>>> (I cc'ed our dev-mailing list to notify the developers on GPU)
>>>
>>> Regards,
>>> Wei Wang
>>>
>>>
>>>
>>> On Thu, Jul 23, 2015 at 9:47 AM, Ce Zhang <czhang@cs.wisc.edu> wrote:
>>>
>>>> Hi Wei,
>>>>
>>>> I am Ce from Wisconsin and Chris Re's group.
>>>> I am one of the developer of Caffe con Troll
>>>> (the CNN system that is faster than Caffe on
>>>> CPU and can run hybrid between CPU and GPU)
>>>>
>>>> I think you and Stefan (CC'ed) chat at SIGMOD
>>>> about the possibility of integrating Caffe con Troll into
>>>> Apache SINGA. We are very excited about this!
>>>>
>>>> We are very curious about how to make this happen,
>>>> e.g., what information do you need from us to do
>>>> such an integration. This email aims at starting this
>>>> discussion, and we'd love to hear your opinions.
>>>>
>>>> Thanks!
>>>>
>>>> Ce
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message