ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaimin D Jetly (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-4530) Cluster install errors out strangely without starting services
Date Fri, 21 Feb 2014 03:11:19 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jaimin D Jetly updated AMBARI-4530:
-----------------------------------

    Description: 
On a two host cluster and one of the agents was down.

First INSTALL attempt fails as tasks for the down agent time out and get aborted.

When INSTALL is retried, there are no tasks created for one host (as agent is down and thus
host is in HEARTBEAT_LOST state).
{noformat}
06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - Command is not
created for servicecomponenthost , clusterName=c1, clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER,
hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, targetNewState=INSTALLED
{noformat}

However some tasks get created for the other agent and those succeed. At this point, FE assumes
that install succeeded and then issues a START all. That results in state change errors we
see in the log.
_FE assumption is based on the fact that all tasks created succeeded._

{noformat}
06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught AmbariException
when modifying a resource
org.apache.ambari.server.AmbariException: Invalid transition for servicecomponenthost, clusterName=c1,
clusterId=2, serviceName=ZOOKEEPER, componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org,
currentState=INSTALL_FAILED, newDesiredState=STARTED
{noformat}

We should discuss possible solutions. One solution could be to have FE not issue a START if
there are master components that are in INSTALL_FAILED state. In addition, if we can show
that some hosts are in HEARTBEAT_LOST state then it can help user debug the situation. Other
option is to have BE somehow indicate that tasks did not get created for host(s). In any case,
when a host is down, we need a way to get out of the install wizard.

> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>
>                 Key: AMBARI-4530
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4530
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.4
>            Reporter: Jaimin D Jetly
>            Assignee: Jaimin D Jetly
>             Fix For: 1.5.0
>
>         Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Screen Shot 2014-02-18 at
3.20.01 PM.png, Screen Shot 2014-02-18 at 3.31.01 PM.png, Screen Shot 2014-02-18 at 3.31.16
PM.png
>
>
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is down and
thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - Command is
not created for servicecomponenthost , clusterName=c1, clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER,
hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this point,
FE assumes that install succeeded and then issues a START all. That results in state change
errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught AmbariException
when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for servicecomponenthost,
clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org,
currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not issue a START
if there are master components that are in INSTALL_FAILED state. In addition, if we can show
that some hosts are in HEARTBEAT_LOST state then it can help user debug the situation. Other
option is to have BE somehow indicate that tasks did not get created for host(s). In any case,
when a host is down, we need a way to get out of the install wizard.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message