Return-Path: X-Original-To: apmail-ambari-issues-archive@minotaur.apache.org Delivered-To: apmail-ambari-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F06D918819 for ; Mon, 21 Mar 2016 18:56:25 +0000 (UTC) Received: (qmail 9065 invoked by uid 500); 21 Mar 2016 18:56:25 -0000 Delivered-To: apmail-ambari-issues-archive@ambari.apache.org Received: (qmail 9015 invoked by uid 500); 21 Mar 2016 18:56:25 -0000 Mailing-List: contact issues-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list issues@ambari.apache.org Received: (qmail 8996 invoked by uid 99); 21 Mar 2016 18:56:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Mar 2016 18:56:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A21E22C14DC for ; Mon, 21 Mar 2016 18:56:25 +0000 (UTC) Date: Mon, 21 Mar 2016 18:56:25 +0000 (UTC) From: "Alejandro Fernandez (JIRA)" To: issues@ambari.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AMBARI-15446) Auto-retry on failure during RU/EU MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AMBARI-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Fernandez updated AMBARI-15446: ----------------------------------------- Attachment: AMBARI-15446.trunk.patch > Auto-retry on failure during RU/EU > ---------------------------------- > > Key: AMBARI-15446 > URL: https://issues.apache.org/jira/browse/AMBARI-15446 > Project: Ambari > Issue Type: Story > Components: ambari-server > Affects Versions: 2.4.0 > Reporter: Alejandro Fernandez > Assignee: Alejandro Fernandez > Fix For: 2.4.0 > > Attachments: AMBARI-15446.trunk.patch > > > When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it. > ambari.properties will have 1 new parameter. E.g,. > stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present) > If Ambari Server is restarted, it should be able to recover. > Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries. > For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it. > Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed. > In order to know when a task was first started, will add a new property to host_role_command table called original_start_time. > For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332)