Mailing-List: contact dev-help@reef.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@reef.apache.org
Date: Thu, 22 Sep 2016 18:46:20 +0000 (UTC)
From: "Mariia Mykhailova (JIRA)" <jira@apache.org>
To: dev@reef.apache.org
Message-ID: <JIRA.12972262.1464056015000.641207.1474569980613@Atlassian.JIRA>
In-Reply-To: <JIRA.12972262.1464056015000@Atlassian.JIRA>
References: <JIRA.12972262.1464056015000@Atlassian.JIRA> <JIRA.12972262.1464056015047@arcas>
Subject: [jira] [Commented] (REEF-1404) IMRU task state Maintenance and
 Preservation in Evaluator for fault tolerant
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 22 Sep 2016 18:46:22 -0000


    [ https://issues.apache.org/jira/browse/REEF-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514139#comment-15514139 ] 

Mariia Mykhailova commented on REEF-1404:
-----------------------------------------

The task of preserving the state of IMRU calculations and recovery from preserved state in case of a failure can be done on several levels:

1. Preserve state in memory, so that IMRU tasks restarted after evaluator failure within the same IMRU job can start calculations from some iteration instead of from scratch.
2. Preserve state in some persistent storage, so that re-run on the whole IMRU job on the same data can use it.

We need to implement first level soon.

Solution approaches:

I. *REEF layer does algorithm-agnostic preservation with no or little action required from user.* 
The comments which mention {{UpdateResult}} and API with GetState/AddState have this approach in mind, since this is the structure which holds the state of calculations in {{UpdateTaskHost}} on REEF layer, and {{UpdateTaskHost}} would call these API methods. The easiest thing to do is to preserve {{UpdateResult}} in master context memory, so restarted {{UpdateTask}} has access to the latest state.

Pros: 
1. Similarly to tasks/evaluators restart part of fault tolerance, user doesn't need to write any code to benefit from state preservation.
Cons: 
1. An algorithm can have stateful {{UpdateFunction}}, and this approach will not preserve proper state.
2. No possibility for smarter algorithm-agnostic preservation.
3. Lots of complexity when doing persistent preservation: user has to configure storage, provide their credentials/security tokens to access it etc.
4. Specific to IMRU workflow.

II. *REEF layer gives user access to abstract entity bound to context, and user code handles everything related to state.*
The comments which mention interface without any API have this approach in mind.

Pros:
1. Minimal change in REEF.
2. User can create arbitrary checkpointing schemes depending on algorithm and storage available.
3. Generic enough to be used by workflows other than IMRU.
Cons:
1. All development cost of state handling is passed to user.

III. Middle ground: *REEF layer deals with storage part of the task, and user deals with state part.*
The comments which mention codecs, serialization and templated APIs seem to mean this approach.

Pros:
1. Reduced development cost for user.
Cons:
1. User still can't benefit from checkpointing "for free".
2. Reduced flexibility of what schemes user can implement.
3. Same as in approach I, lots of configuration complexity for storage.

Currently the discussion seems to favor approach II.

> IMRU task state Maintenance and Preservation in Evaluator for fault tolerant
> ----------------------------------------------------------------------------
>
>                 Key: REEF-1404
>                 URL: https://issues.apache.org/jira/browse/REEF-1404
>             Project: REEF
>          Issue Type: Task
>            Reporter: Julia
>              Labels: FT
>
> IMRU task should be able to 
> * Maintenance and preservation the state
> * When restart, able to recover from the previous sate


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)