singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wang Wei <wang...@comp.nus.edu.sg>
Subject Re: Communication between GPUs
Date Wed, 22 Apr 2015 02:33:02 GMT
Thanks a lot!



On Tue, Apr 21, 2015 at 5:51 PM, 陈海波 <hzchenhaibo@corp.netease.com> wrote:

> Hi~,
>     In our preivious work of deep learning on GPU, we focus on parallel
> training of DNN(without convolution layer) for our speech recognition.
> It is't easy to adopt model parallelization strategy to speed training.And
> with the consideration of transfering big model between node and
> node,so we decided to use a single node with multi-GPU for training. And
> we use CUDA APIs For transfering messages between GPU and GPU
> (both support GPUDirect and without GPUDirect).In our plan,it does not
> exits the problem of multi node communication.
> Some discussions:
> 1) We think cudamat is a good choice for linear algebra computataion. But
> we find you use mshadow libraries to develop singa.
>    As we know,mshadow can provide a GPU Matrix/Tensor Template libary,and
> it also support some simple interfaces for Multi-GPU.So we think
> we can go on using mshadow for linear algebra computataion on both GPU and
> CPU.
>
Yes. We will continue using Mshadow.

> 2)  we consult NVIDIA's officials and they give an answer that they are
> not sure whether ZeroMQ supports GPUDirect and Infiniband or not,and
>    they suggest us adopting OpenMPI.
>
> ZeroMQ should support Infiniband (http://zeromq.org/area:results). But
may not support GPUDirect. It seems Caffe (
https://github.com/BVLC/caffe/blob/parallel/src/caffe/parallel.cpp) is
implementing distributed training using GPU+Infiniband, but GPUDirect is
not used. I will learn more about GPUDirect and discuss with you.
Another solution that I am trying to do is to provide a general messaging
API (like https://github.com/dmlc/rabit) and provide different
implementations (ZeroMQ or MPI).

   And I think we can discuss more.
>
> thanks~
>
> 在2015-04-21 12:05:19,陈海波<hzchenhaibo@corp.netease.com>写道:
> > As planed in the previous discussion, we are stabilizing the APIs of each
> > module.
> > One problem I am encountered is about the communication APIs to support
> > GPUs.
> >
> > We can use some libraries like cudamat (
> https://code.google.com/p/cudamat/)
> > for linear algebra computation. Hence, the APIs on computation would
> almost
> > the same as those for CPU. But I have poor knowledge on the communication
> > between GPU and CPU, and the communication between GPUs.
> > I am asking you for your suggestions.
> >
> > Wangyuan, Wuwei and Haibo: Since you are working on deep learning using
> > GPUs, it would be appreciated if you can give some feedback.
> >
> > As far as I know that traditionally messages are transferred from GPU
> > memory to CPU memory and then transferred through TCP/IP to other nodes
> and
> > then transferred from CPU memory to GPU memory. We can easily support
> such
> > communication using the current APIs for CPU. But the transferring
> between
> > GPU and CPU would bring extra cost.
> > NVDIA has provided a technique called GPUDirect, which enables directly
> > message passing from GPU memory to network (e.g., infiniband) card. Some
> > MPI variants now use this technique. But we have switched from MPI to
> > ZeroMQ, we need to make sure that ZeroMQ supports GPUDirect and
> > Infiniband.  Do you have any investigations on this? Or how do you
> > implement the message transferring in your implementation?
> >
> > Thanks.
> >
> > regards,
> > Wei
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message