Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B3C9F200B9A for ; Thu, 22 Sep 2016 20:46:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B2999160AA9; Thu, 22 Sep 2016 18:46:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E8B86160AE2 for ; Thu, 22 Sep 2016 20:46:21 +0200 (CEST) Received: (qmail 70567 invoked by uid 500); 22 Sep 2016 18:46:21 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 70259 invoked by uid 99); 22 Sep 2016 18:46:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2016 18:46:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 967B92C2A6A for ; Thu, 22 Sep 2016 18:46:20 +0000 (UTC) Date: Thu, 22 Sep 2016 18:46:20 +0000 (UTC) From: "Mariia Mykhailova (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (REEF-1404) IMRU task state Maintenance and Preservation in Evaluator for fault tolerant MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 22 Sep 2016 18:46:22 -0000 [ https://issues.apache.org/jira/browse/REEF-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514139#comment-15514139 ] Mariia Mykhailova commented on REEF-1404: ----------------------------------------- The task of preserving the state of IMRU calculations and recovery from preserved state in case of a failure can be done on several levels: 1. Preserve state in memory, so that IMRU tasks restarted after evaluator failure within the same IMRU job can start calculations from some iteration instead of from scratch. 2. Preserve state in some persistent storage, so that re-run on the whole IMRU job on the same data can use it. We need to implement first level soon. Solution approaches: I. *REEF layer does algorithm-agnostic preservation with no or little action required from user.* The comments which mention {{UpdateResult}} and API with GetState/AddState have this approach in mind, since this is the structure which holds the state of calculations in {{UpdateTaskHost}} on REEF layer, and {{UpdateTaskHost}} would call these API methods. The easiest thing to do is to preserve {{UpdateResult}} in master context memory, so restarted {{UpdateTask}} has access to the latest state. Pros: 1. Similarly to tasks/evaluators restart part of fault tolerance, user doesn't need to write any code to benefit from state preservation. Cons: 1. An algorithm can have stateful {{UpdateFunction}}, and this approach will not preserve proper state. 2. No possibility for smarter algorithm-agnostic preservation. 3. Lots of complexity when doing persistent preservation: user has to configure storage, provide their credentials/security tokens to access it etc. 4. Specific to IMRU workflow. II. *REEF layer gives user access to abstract entity bound to context, and user code handles everything related to state.* The comments which mention interface without any API have this approach in mind. Pros: 1. Minimal change in REEF. 2. User can create arbitrary checkpointing schemes depending on algorithm and storage available. 3. Generic enough to be used by workflows other than IMRU. Cons: 1. All development cost of state handling is passed to user. III. Middle ground: *REEF layer deals with storage part of the task, and user deals with state part.* The comments which mention codecs, serialization and templated APIs seem to mean this approach. Pros: 1. Reduced development cost for user. Cons: 1. User still can't benefit from checkpointing "for free". 2. Reduced flexibility of what schemes user can implement. 3. Same as in approach I, lots of configuration complexity for storage. Currently the discussion seems to favor approach II. > IMRU task state Maintenance and Preservation in Evaluator for fault tolerant > ---------------------------------------------------------------------------- > > Key: REEF-1404 > URL: https://issues.apache.org/jira/browse/REEF-1404 > Project: REEF > Issue Type: Task > Reporter: Julia > Labels: FT > > IMRU task should be able to > * Maintenance and preservation the state > * When restart, able to recover from the previous sate -- This message was sent by Atlassian JIRA (v6.3.4#6332)