Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6EE1C200B61 for ; Tue, 9 Aug 2016 19:13:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6D5D9160A6B; Tue, 9 Aug 2016 17:13:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AAA3D160AA5 for ; Tue, 9 Aug 2016 19:13:21 +0200 (CEST) Received: (qmail 65379 invoked by uid 500); 9 Aug 2016 17:13:20 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 65366 invoked by uid 99); 9 Aug 2016 17:13:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2016 17:13:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7146F2C02A4 for ; Tue, 9 Aug 2016 17:13:20 +0000 (UTC) Date: Tue, 9 Aug 2016 17:13:20 +0000 (UTC) From: "Andrey (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (REEF-1511) timeout for Task Shutdown during IMRU recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 09 Aug 2016 17:13:22 -0000 Andrey created REEF-1511: ---------------------------- Summary: timeout for Task Shutdown during IMRU recovery Key: REEF-1511 URL: https://issues.apache.org/jira/browse/REEF-1511 Project: REEF Issue Type: Improvement Components: IMRU Reporter: Andrey This related to fault tolerance implementation in PR-1251. Currently recovery logic in IMRU driver is to wait for all task to move to a final state (failed or completed) before restarting the job check AreAllTasksInFinalState() in TryRecovery() method) We've seen driver hanging for a long time waiting for few last tasks finalize. Aborting tasks should be quick, so there is bug there, but we also can add logic in driver not to wait for all tasks to complete. For instance: if 5% of tasks did not report final state withing expected period, release corresponding evaluators and proceed with new job retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332)