ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaimin D Jetly (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-4530) Cluster install errors out strangely without starting services
Date Fri, 21 Feb 2014 21:09:19 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jaimin D Jetly updated AMBARI-4530:
-----------------------------------

    Attachment:     (was: Screen Shot 2014-02-18 at 3.20.01 PM.png)

> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>
>                 Key: AMBARI-4530
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4530
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.4
>            Reporter: Jaimin D Jetly
>            Assignee: Jaimin D Jetly
>             Fix For: 1.5.0
>
>         Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch
>
>
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is down and
thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - Command is
not created for servicecomponenthost , clusterName=c1, clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER,
hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this point,
FE assumes that install succeeded and then issues a START all. That results in state change
errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught AmbariException
when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for servicecomponenthost,
clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org,
currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not issue a START
if there are master components that are in INSTALL_FAILED state. In addition, if we can show
that some hosts are in HEARTBEAT_LOST state then it can help user debug the situation. Other
option is to have BE somehow indicate that tasks did not get created for host(s). In any case,
when a host is down, we need a way to get out of the install wizard.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message