Hi Sourav,

RDMA-based model synchronization can be considered in our future roadmap. Here are two epic that this task is dependent on:
https://issues.apache.org/jira/browse/SYSTEMML-540
https://issues.apache.org/jira/browse/SYSTEMML-445

It is also important to note that "model parallelism" gets us into accuracy v/s performance tradeoff discussion. SystemML optimizer/rewrites in its current state, makes decisions to improve the performance and doesnot change the semantics of the input DML script. So, an imprecise language level primitive (for example "minimize(layer)") might be required to support "model parallelism".

Another point: my guess is that Yahoo ML team have added a parallel MPI-based communication layer (for example: using MVAPICH2 to enable CUDA 5.0's GPUDirect RDMA) rather than rely on Spark's communication layer. I understand this is necessary for distributed learning for Deep Networks (to avoid paying the cost of communicating to/from JVM). However, it is a non-trivial addition to any system which needs detailed discussion (for example: on fault-tolerance, permissions, homogenous clusters, etc) :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

Inactive hide details for Sourav Mazumder ---02/25/2016 07:33:21 AM---Hi, Was checking CaffeOnSpark for Deep Learning which hasSourav Mazumder ---02/25/2016 07:33:21 AM---Hi, Was checking CaffeOnSpark for Deep Learning which has just got open sourced

From: Sourav Mazumder <sourav.mazumder00@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 02/25/2016 07:33 AM
Subject: Using RDMA for Deep Learning





Hi,

Was checking CaffeOnSpark for Deep Learning which has just got open sourced
by Yahoo -
http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep?soc_src=mail&soc_trk=ma
.

Wondering whether SystemML can also leverage the RDMA based Model
synchronization approach as CaffeOnSpark can do.

If not can that be considered in future roadmap ?

Regards,
Sourav