reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariia Mykhailova (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1404) IMRU task state Maintenance and Preservation in Evaluator for fault tolerant
Date Thu, 22 Sep 2016 18:46:20 GMT

    [ https://issues.apache.org/jira/browse/REEF-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514139#comment-15514139
] 

Mariia Mykhailova commented on REEF-1404:
-----------------------------------------

The task of preserving the state of IMRU calculations and recovery from preserved state in
case of a failure can be done on several levels:

1. Preserve state in memory, so that IMRU tasks restarted after evaluator failure within the
same IMRU job can start calculations from some iteration instead of from scratch.
2. Preserve state in some persistent storage, so that re-run on the whole IMRU job on the
same data can use it.

We need to implement first level soon.

Solution approaches:

I. *REEF layer does algorithm-agnostic preservation with no or little action required from
user.* 
The comments which mention {{UpdateResult}} and API with GetState/AddState have this approach
in mind, since this is the structure which holds the state of calculations in {{UpdateTaskHost}}
on REEF layer, and {{UpdateTaskHost}} would call these API methods. The easiest thing to do
is to preserve {{UpdateResult}} in master context memory, so restarted {{UpdateTask}} has
access to the latest state.

Pros: 
1. Similarly to tasks/evaluators restart part of fault tolerance, user doesn't need to write
any code to benefit from state preservation.
Cons: 
1. An algorithm can have stateful {{UpdateFunction}}, and this approach will not preserve
proper state.
2. No possibility for smarter algorithm-agnostic preservation.
3. Lots of complexity when doing persistent preservation: user has to configure storage, provide
their credentials/security tokens to access it etc.
4. Specific to IMRU workflow.

II. *REEF layer gives user access to abstract entity bound to context, and user code handles
everything related to state.*
The comments which mention interface without any API have this approach in mind.

Pros:
1. Minimal change in REEF.
2. User can create arbitrary checkpointing schemes depending on algorithm and storage available.
3. Generic enough to be used by workflows other than IMRU.
Cons:
1. All development cost of state handling is passed to user.

III. Middle ground: *REEF layer deals with storage part of the task, and user deals with state
part.*
The comments which mention codecs, serialization and templated APIs seem to mean this approach.

Pros:
1. Reduced development cost for user.
Cons:
1. User still can't benefit from checkpointing "for free".
2. Reduced flexibility of what schemes user can implement.
3. Same as in approach I, lots of configuration complexity for storage.

Currently the discussion seems to favor approach II.

> IMRU task state Maintenance and Preservation in Evaluator for fault tolerant
> ----------------------------------------------------------------------------
>
>                 Key: REEF-1404
>                 URL: https://issues.apache.org/jira/browse/REEF-1404
>             Project: REEF
>          Issue Type: Task
>            Reporter: Julia
>              Labels: FT
>
> IMRU task should be able to 
> * Maintenance and preservation the state
> * When restart, able to recover from the previous sate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message