systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sourav Mazumder <sourav.mazumde...@gmail.com>
Subject Re: Using RDMA for Deep Learning
Date Fri, 26 Feb 2016 15:50:49 GMT
Hi Niketan,

Thanks for the detailed information.

Here are few follow on comments/questions -

1. May be instead of trying to re-create MPI-based communication layer
necessary for distributed learning for Deep Networks why not System ML also
plan to support CaffeOnSpark itself as another run time platform. At least
to start with. May be later on System ML can have its more sophisticated
implementation.
Also may be sooner or later CaffeOnSpark components become part of Spark
Distro (core or extension).

2. Can today someone still write a DML for Neural Network using the
existing facilities of System ML ? If yes what would be the drawback of
that (because as I understand current implementation will not have the
advanced features mentioned in epic 540 and 445) ?

Regards,
Sourav
<https://issues.apache.org/jira/browse/SYSTEMML-540>

On Thu, Feb 25, 2016 at 10:24 AM, Niketan Pansare <npansar@us.ibm.com>
wrote:

> Hi Sourav,
>
> RDMA-based model synchronization can be considered in our future roadmap.
> Here are two epic that this task is dependent on:
> https://issues.apache.org/jira/browse/SYSTEMML-540
> https://issues.apache.org/jira/browse/SYSTEMML-445
>
> It is also important to note that "model parallelism" gets us into
> accuracy v/s performance tradeoff discussion. SystemML optimizer/rewrites
> in its current state, makes decisions to improve the performance and
> doesnot change the semantics of the input DML script. So, an imprecise
> language level primitive (for example "minimize(layer)") might be required
> to support "model parallelism".
>
> Another point: my guess is that Yahoo ML team have added a parallel
> MPI-based communication layer (for example: using MVAPICH2 to enable CUDA
> 5.0's GPUDirect RDMA) rather than rely on Spark's communication layer. I
> understand this is necessary for distributed learning for Deep Networks (to
> avoid paying the cost of communicating to/from JVM). However, it is a
> non-trivial addition to any system which needs detailed discussion (for
> example: on fault-tolerance, permissions, homogenous clusters, etc) :)
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---02/25/2016 07:33:21
> AM---Hi, Was checking CaffeOnSpark for Deep Learning which has]Sourav
> Mazumder ---02/25/2016 07:33:21 AM---Hi, Was checking CaffeOnSpark for Deep
> Learning which has just got open sourced
>
> From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 02/25/2016 07:33 AM
> Subject: Using RDMA for Deep Learning
> ------------------------------
>
>
>
> Hi,
>
> Was checking CaffeOnSpark for Deep Learning which has just got open sourced
> by Yahoo -
>
> http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep?soc_src=mail&soc_trk=ma
> .
>
> Wondering whether SystemML can also leverage the RDMA based Model
> synchronization approach as CaffeOnSpark can do.
>
> If not can that be considered in future roadmap ?
>
> Regards,
> Sourav
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message