ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-15446) Auto-retry on failure during RU/EU
Date Tue, 22 Mar 2016 00:44:25 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205466#comment-15205466
] 

Hudson commented on AMBARI-15446:
---------------------------------

FAILURE: Integrated in Ambari-trunk-Commit #4525 (See [https://builds.apache.org/job/Ambari-trunk-Commit/4525/])
AMBARI-15446. Auto-retry on failure during RU/EU (alejandro) (afernandez: [http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=79b9e570fe1549bfc312dc02c29246c56043face])
* ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
* ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
* ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
* ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql
* ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql
* ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql
* ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql
* ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
* ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
* ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
* ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java
* ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
* ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql
* ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
* ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java
* ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql
* ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java


> Auto-retry on failure during RU/EU
> ----------------------------------
>
>                 Key: AMBARI-15446
>                 URL: https://issues.apache.org/jira/browse/AMBARI-15446
>             Project: Ambari
>          Issue Type: Story
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 2.4.0
>
>         Attachments: AMBARI-15446.trunk.patch
>
>
> When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT,
want Ambari to automatically retry up to up to x mins. This is useful when a host goes down
as Ambari is running a task on it.
> ambari.properties will have 1 new parameter. E.g,. 
> stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
> If Ambari Server is restarted, it should be able to recover.
> Today, Action Scheduler increases the attempt_count whenever a task is retried, but it
requires resetting the start_time to -1. Because of this, we cannot rely on the start_time
property to know when to timeout after several retries.
> For the implementation, will add another thread to Ambari that will monitor failed tasks
only during active RU/EU and change the status back to PENDING so that Action Scheduler can
reschedule it.
> Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no
other stages are allowed to proceed.
> In order to know when a task was first started, will add a new property to host_role_command
table called original_start_time.
> For the agents, we need to ensure that they always write out a response. On the first
heartbeat, it should send the status of its last command so we know it failed and Ambari can
retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message