ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (AMBARI-10606) Ambari Agent needs to retry failed install/start operations
Date Tue, 21 Apr 2015 23:04:58 GMT


Hudson commented on AMBARI-10606:

SUCCESS: Integrated in Ambari-trunk-Commit #2392 (See [])
Ambari-10606. Ambari Agent needs to retry failed install/start operations (smohanty:
* ambari-agent/src/main/python/ambari_agent/
* ambari-agent/src/main/python/ambari_agent/
* ambari-agent/src/test/python/ambari_agent/
* ambari-server/src/main/java/org/apache/ambari/server/agent/
* ambari-agent/src/test/python/ambari_agent/
* ambari-server/src/main/java/org/apache/ambari/server/configuration/
* ambari-server/src/main/java/org/apache/ambari/server/controller/
* ambari-server/src/main/java/org/apache/ambari/server/controller/

> Ambari Agent needs to retry failed install/start operations
> -----------------------------------------------------------
>                 Key: AMBARI-10606
>                 URL:
>             Project: Ambari
>          Issue Type: Task
>    Affects Versions: 2.0.0
>            Reporter: Sumit Mohanty
>            Assignee: Sumit Mohanty
>             Fix For: 2.1.0
>         Attachments: AMBARI-10606.patch
> WIth the changes to cluster provisioning in Ambari 2.1, each host is provisioned independently
in it's own request. Additionally, users may make provisioning requests prior to hosts becoming
available. This means that components that connect to other components in the cluster may
start prior to the component that they are attempting to connect to. This connect behavior
is outside of Ambari proper and differs significantly between services/components.
> An example of this is HISTORY_SERVER which attempts to connect to NAMENODE and if it
fails to connect, it retries a couple of times and fails with a timeout after a small number
of seconds.
> As a result, the ambari agent in 2.1 needs to retry failed operations (especially start
operations). The retry timeout should be a significant amount of time and could be configurable.
This will allow hosts to join the cluster at different times without component connection
timeouts causing the request to "fail".
> Currently when a timeout occurs, it doesn't affect other component operations but does
result in a "FAILED" response to the user and the user will need to manually start the failed

This message was sent by Atlassian JIRA

View raw message