systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niketan Pansare" <>
Subject Re: Using RDMA for Deep Learning
Date Thu, 25 Feb 2016 18:24:26 GMT

Hi Sourav,

RDMA-based model synchronization can be considered in our future roadmap.
Here are two epic that this task is dependent on:

It is also important to note that "model parallelism" gets us into accuracy
v/s performance tradeoff discussion. SystemML optimizer/rewrites in its
current state, makes decisions to improve the performance and doesnot
change the semantics of the input DML script. So, an imprecise language
level primitive (for example "minimize(layer)") might be required to
support "model parallelism".

Another point: my guess is that Yahoo ML team have added a parallel
MPI-based communication layer (for example: using MVAPICH2 to enable CUDA
5.0's GPUDirect RDMA) rather than rely on Spark's communication layer. I
understand this is necessary for distributed learning for Deep Networks (to
avoid paying the cost of communicating to/from JVM). However, it is a
non-trivial addition to any system which needs detailed discussion (for
example: on fault-tolerance, permissions, homogenous clusters, etc) :)


Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At

From:	Sourav Mazumder <>
Date:	02/25/2016 07:33 AM
Subject:	Using RDMA for Deep Learning


Was checking CaffeOnSpark for Deep Learning which has just got open sourced
by Yahoo -


Wondering whether SystemML can also leverage the RDMA based Model
synchronization approach as CaffeOnSpark can do.

If not can that be considered in future roadmap ?


  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message