systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SYSTEMML-2087) Initial version of distributed spark backend
Date Sun, 13 May 2018 01:02:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473322#comment-16473322
] 

Matthias Boehm edited comment on SYSTEMML-2087 at 5/13/18 1:01 AM:
-------------------------------------------------------------------

Once we come closer to this task, it would be good to flash out the details in terms of sub
tasks. For example, we need to decide (1) how to distribute the data (for the different distribution
schemes) to the individual workers, (2) how to do the worker setup and cleanup (e.g., directories
for local evictions; most of this functionality can be reused from parfor but it would be
good to clarify what exactly it entails), (3) how to implement the parameter exchange, and
(4) how to handle task failures and preemption. Regarding the latter, I would recommend to
start simple with something like "once a worker is brought up it pulls the current state of
the model" and checkpointing is done in a centralized manner.


was (Author: mboehm7):
Once we come closer to this task, it would be good to flash out the details in terms of sub
tasks. For example, we need to decide (1) how to distribute the data (for the different distribution
schemes) to the individual workers, (2) how to implement the parameter exchange, and (3) how
to handle task failures and preemption. Regarding the latter, I would recommend to start simple
with something like once a worker is brought up it pulls the current state of the model and
checkpointing is done in a centralized manner.

> Initial version of distributed spark backend
> --------------------------------------------
>
>                 Key: SYSTEMML-2087
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> This part aims to implement the BSP for spark distributed backend. Hence the idea is
to be able to launch a remote parameter server and the workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message