Return-Path: X-Original-To: apmail-reef-dev-archive@minotaur.apache.org Delivered-To: apmail-reef-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3A4E199C3 for ; Fri, 25 Mar 2016 20:11:25 +0000 (UTC) Received: (qmail 81970 invoked by uid 500); 25 Mar 2016 20:11:25 -0000 Delivered-To: apmail-reef-dev-archive@reef.apache.org Received: (qmail 81915 invoked by uid 500); 25 Mar 2016 20:11:25 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 81890 invoked by uid 99); 25 Mar 2016 20:11:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Mar 2016 20:11:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7566B2C14F6 for ; Fri, 25 Mar 2016 20:11:25 +0000 (UTC) Date: Fri, 25 Mar 2016 20:11:25 +0000 (UTC) From: "Markus Weimer (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (REEF-1223) IMRU Fault Tolerance - restart failed evaluators MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/REEF-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212330#comment-15212330 ] Markus Weimer commented on REEF-1223: ------------------------------------- Yes, this, in a way, is "poor man's elasticity", right? It should work fine when we suffer from individual failures, but not if we suffer through cascades. If we drive the group communication setup cost to (almost) zero, this will approach elasticity in utility with none of the complexities involved. > IMRU Fault Tolerance - restart failed evaluators > ------------------------------------------------ > > Key: REEF-1223 > URL: https://issues.apache.org/jira/browse/REEF-1223 > Project: REEF > Issue Type: New Feature > Components: IMRU, REEF.NET > Reporter: Julia > Assignee: Julia > > Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed for whatever reason, all the Evaluators will be killed by the driver. > There are multiple levels of fault tolerant. The scenario we would like to support in this JIRA is: > * When an evaluator failed, this failed evaluator will be killed and other good Evaluators will stay, but all the tasks running on those Evaluators will be stopped. > * A new Evaluator will be requested and started with the original task. > * Same tasks will be resubmitted to the rest the Evaluators > * The topology of those tasks will be kept in the same group communication as before > * The data that have been downloaded in those good Evaluators will stay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)