ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez (JIRA)" <>
Subject [jira] [Created] (AMBARI-15446) Auto-retry on failure during RU/EU
Date Wed, 16 Mar 2016 20:21:33 GMT
Alejandro Fernandez created AMBARI-15446:

             Summary: Auto-retry on failure during RU/EU
                 Key: AMBARI-15446
             Project: Ambari
          Issue Type: Story
          Components: ambari-server
    Affects Versions: 2.4.0
            Reporter: Alejandro Fernandez
            Assignee: Alejandro Fernandez
             Fix For: 2.4.0

When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT,
want Ambari to automatically retry up to up to x mins. This is useful when a host goes down
as Ambari is running a task on it. will have 1 new parameter. E.g,. 
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)

If Ambari Server is restarted, it should be able to recover.

Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires
resetting the start_time to -1. Because of this, we cannot rely on the start_time property
to know when to timeout after several retries.
For the implementation, will add another thread to Ambari that will monitor failed tasks only
during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other
stages are allowed to proceed.
In order to know when a task was first started, will add a new property to host_role_command
table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the first heartbeat,
it should send the status of its last command so we know it failed and Ambari can retry.

This message was sent by Atlassian JIRA

View raw message