systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SYSTEMML-2349) Local worker error handling
Date Wed, 30 May 2018 21:14:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495685#comment-16495685
] 

Matthias Boehm edited comment on SYSTEMML-2349 at 5/30/18 9:13 PM:
-------------------------------------------------------------------

Well, since both workers and aggregator execute user-defined functions we need to make the
threads robust enough to handle errors and ensure termination. One approach to do that is
a callback mechanism: (1) keep the threads as members in the invoking instruction, (2) pass
the instruction as an argument into the workers and aggregator, (3) wrap the invocation of
user-defined functions into try-catch, and (4) call a proper shutdown method of the instruction
(with access to the thread members) from the respective catch clauses.

Also, once you reworked the queuing logic, we might be able to avoid running the aggregator
as a thread. I could imagine a design where we simply call execute on the parameter server,
and this parameter server internally spawns its workers and simply returns whenever all workers
are done or an error occurred.



was (Author: mboehm7):
Well, since both workers and aggregator execute user-defined functions we need to make the
threads robust enough the handle errors and ensure termination. One approach to do that is
to use a callback as follows: (1) keep the threads as members in the invoking instruction,
(2) pass the instruction as an argument into the workers and aggregator, (3) wrap the invocation
of user-defined functions into try-catch, and (4) call a proper shutdown method of the instruction
(with access to the thread members) from the respective catch clauses.

Also, once you reworked the queuing logic, we might be able to avoid running the aggregator
as a thread. I could imagine a design where we simply call execute on the parameter server,
and this parameter server internally spawns its workers and simply returns whenever all workers
are done or an error occurred.


> Local worker error handling
> ---------------------------
>
>                 Key: SYSTEMML-2349
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2349
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> While playing around with the locking scheme of the parameter server, I encountered unrelated
errors that led to the parameter server hanging. We need to make sure all worker errors are
correctly propagated so that we can guarantee termination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message