ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaimin D Jetly (JIRA)" <>
Subject [jira] [Commented] (AMBARI-4530) Cluster install errors out strangely without starting services
Date Sat, 22 Feb 2014 03:36:19 GMT


Jaimin D Jetly commented on AMBARI-4530:

Patch committed to trunk.

> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>                 Key: AMBARI-4530
>                 URL:
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.4
>            Reporter: Jaimin D Jetly
>            Assignee: Jaimin D Jetly
>             Fix For: 1.5.0
>         Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Solution-1.png, Solution-2.png,
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is down and
thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - Command is
not created for servicecomponenthost , clusterName=c1, clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER,, hostState=HEARTBEAT_LOST, targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this point,
FE assumes that install succeeded and then issues a START all. That results in state change
errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught AmbariException
when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for servicecomponenthost,
clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, componentName=ZOOKEEPER_SERVER,,
currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not issue a START
if there are master components that are in INSTALL_FAILED state. In addition, if we can show
that some hosts are in HEARTBEAT_LOST state then it can help user debug the situation. Other
option is to have BE somehow indicate that tasks did not get created for host(s). In any case,
when a host is down, we need a way to get out of the install wizard.

This message was sent by Atlassian JIRA

View raw message