Mailing-List: contact issues-help@ambari.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ambari.apache.org
Date: Mon, 21 Mar 2016 18:56:25 +0000 (UTC)
From: "Alejandro Fernandez (JIRA)" <jira@apache.org>
To: issues@ambari.apache.org
Message-ID: <JIRA.12950980.1458159659000.8239.1458586585660@Atlassian.JIRA>
In-Reply-To: <JIRA.12950980.1458159659000@Atlassian.JIRA>
References: <JIRA.12950980.1458159659000@Atlassian.JIRA>
 <JIRA.12950980.1458159659714@arcas>
Subject: [jira] [Updated] (AMBARI-15446) Auto-retry on failure during RU/EU
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/AMBARI-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Fernandez updated AMBARI-15446:
-----------------------------------------
    Attachment: AMBARI-15446.trunk.patch

> Auto-retry on failure during RU/EU
> ----------------------------------
>
>                 Key: AMBARI-15446
>                 URL: https://issues.apache.org/jira/browse/AMBARI-15446
>             Project: Ambari
>          Issue Type: Story
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 2.4.0
>
>         Attachments: AMBARI-15446.trunk.patch
>
>
> When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it.
> ambari.properties will have 1 new parameter. E.g,. 
> stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
> If Ambari Server is restarted, it should be able to recover.
> Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries.
> For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it.
> Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed.
> In order to know when a task was first started, will add a new property to host_role_command table called original_start_time.
> For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)