singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Question about model parallelism and multi-GPUs
Date Wed, 11 Nov 2015 02:21:27 GMT
That communication overhead will be occurred at every training instance.

On Wed, Nov 11, 2015 at 11:11 AM, Wang Wei <wangwei@comp.nus.edu.sg> wrote:
> There will be communication overhead.
> However, compared with data parallelism, it would save some communication
> cost from transferring parameters due to model replication.
>
> On Wed, Nov 11, 2015 at 9:51 AM, Edward J. Yoon <edwardyoon@apache.org>
> wrote:
>
>> Partitioning by layer will cause the messages across GPUs. Is this
>> huge performance loss expected?
>>
>> On Wed, Nov 11, 2015 at 10:37 AM, Wang Wei <wangwei@comp.nus.edu.sg>
>> wrote:
>> > Hi Edward,
>> >
>> > Thanks for stating the discussion.
>> >
>> > On Wed, Nov 11, 2015 at 8:44 AM, Edward J. Yoon <edwardyoon@apache.org>
>> > wrote:
>> >
>> >> NYCTMI, I am having all doubts.. RNN model parallel described in
>> >> tensorFlow whitepaper somewhat makes sense to me but did google
>> >> actually uses Model parallel for CNN on multi nodes (equipped
>> >> multi-devices) cluster. Blocking mat-mult on GPU appears to me slow
>> >> and memory demanding. I mean it's possible but performance will
>> >> suffer.
>> >>
>> > No. I didn't find the details on how they implement model parallelism for
>> > fully connected layers.
>> > I think for TensorFlow, model parallelism is done by users. Users specify
>> > the sessions and the sub-graph of each session.
>> >
>> >>
>> >> So, I wanted to check how you approach this problem.
>> >>
>> > We are doing it by partitioning the layer into sub-layers. The online ppt
>> > has more details.
>> > The part is not fully completed, we will release it in the second
>> version.
>> > You can check the code then.
>> >
>> >>
>> >>
>> >> On Wed, Nov 11, 2015 at 9:33 AM, ooibc <ooibc@comp.nus.edu.sg> wrote:
>> >> >
>> >> >
>> >> > fyi, in case you are not in the dev@ list.
>> >> >
>> >> > There is a typo below: ancestor-->descendant (MXNet was introduced
on
>> >> 9/28).
>> >> >
>> >> >
>> >> > -------- Original Message --------
>> >> > Subject: development plan for SINGA
>> >> > Date: 2015-11-10 22:28
>> >> > From: ooibc <ooibc@comp.nus.edu.sg>
>> >> > To: dev@singa.incubator.apache.org
>> >> > Reply-To: dev@singa.incubator.apache.org
>> >> >
>> >> > Based on our quick check at the release of Tensorflow and online
>> >> > discussions, it appears to be 2x slower than MXNet (ancestor of
>> CXXNET)
>> >> on
>> >> > cifar10 dataset, and it contains older
>> >> > codes like cudnn-v2. This could be just a form of crowdsourcing at
>> work.
>> >> >
>> >> > SINGA is data flow centric in design, and provides simple interfaces,
>> >> from
>> >> > layer abstraction to neural net structure, model
>> configuration/mapping,
>> >> > model/data partitioning, function overriding, and training framework
>> >> > configuration.
>> >> >
>> >> > So, we are good and we should keep to the development/release plan
>> >> outlined
>> >> > in
>> >> >     http://singa.apache.org/develop/schedule.html
>> >> >
>> >> > Thanks, and Happy Deepavali (to those who celebrate)!
>> >> >
>> >> > regards
>> >> > beng chin
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>>



-- 
Best Regards, Edward J. Yoon

Mime
View raw message