ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-15446) Auto-retry on failure during RU/EU
Date Wed, 16 Mar 2016 20:21:33 GMT
Alejandro Fernandez created AMBARI-15446:
--------------------------------------------

             Summary: Auto-retry on failure during RU/EU
                 Key: AMBARI-15446
                 URL: https://issues.apache.org/jira/browse/AMBARI-15446
             Project: Ambari
          Issue Type: Story
          Components: ambari-server
    Affects Versions: 2.4.0
            Reporter: Alejandro Fernandez
            Assignee: Alejandro Fernandez
             Fix For: 2.4.0


When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT,
want Ambari to automatically retry up to up to x mins. This is useful when a host goes down
as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,. 
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)

If Ambari Server is restarted, it should be able to recover.

Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires
resetting the start_time to -1. Because of this, we cannot rely on the start_time property
to know when to timeout after several retries.
For the implementation, will add another thread to Ambari that will monitor failed tasks only
during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule
it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other
stages are allowed to proceed.
In order to know when a task was first started, will add a new property to host_role_command
table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the first heartbeat,
it should send the status of its last command so we know it failed and Ambari can retry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message