systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niketan Pansare" <npan...@us.ibm.com>
Subject Re: Using RDMA for Deep Learning
Date Thu, 25 Feb 2016 18:24:26 GMT

Hi Sourav,

RDMA-based model synchronization can be considered in our future roadmap.
Here are two epic that this task is dependent on:
https://issues.apache.org/jira/browse/SYSTEMML-540
https://issues.apache.org/jira/browse/SYSTEMML-445

It is also important to note that "model parallelism" gets us into accuracy
v/s performance tradeoff discussion. SystemML optimizer/rewrites in its
current state, makes decisions to improve the performance and doesnot
change the semantics of the input DML script. So, an imprecise language
level primitive (for example "minimize(layer)") might be required to
support "model parallelism".

Another point: my guess is that Yahoo ML team have added a parallel
MPI-based communication layer (for example: using MVAPICH2 to enable CUDA
5.0's GPUDirect RDMA) rather than rely on Spark's communication layer. I
understand this is necessary for distributed learning for Deep Networks (to
avoid paying the cost of communicating to/from JVM). However, it is a
non-trivial addition to any system which needs detailed discussion (for
example: on fault-tolerance, permissions, homogenous clusters, etc) :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Sourav Mazumder <sourav.mazumder00@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	02/25/2016 07:33 AM
Subject:	Using RDMA for Deep Learning



Hi,

Was checking CaffeOnSpark for Deep Learning which has just got open sourced
by Yahoo -
http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep?soc_src=mail&soc_trk=ma

.

Wondering whether SystemML can also leverage the RDMA based Model
synchronization approach as CaffeOnSpark can do.

If not can that be considered in future roadmap ?

Regards,
Sourav


Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message