ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaimin D Jetly (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-4530) Cluster install errors out strangely without starting services
Date Fri, 21 Feb 2014 22:43:19 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908940#comment-13908940
] 

Jaimin D Jetly commented on AMBARI-4530:
----------------------------------------

*UI Design:*

As a solution, after install task is completed successfully, FE queries the host state of
all hosts successfully registered in the cluster. If any host is in the "HEARTBEAT_LOST" state
then FE declares the cluster to be in "INSTALL FAILED" (Note: This cluster object is stored
by FE in browser localStorage). This will make UI behavior similar to what happens when any
install request task is reported to have failed. Importantly all the links to the previous
steps will be enabled and next button on the page will be disabled. So user cannot complete
the installer wizard but if desired the user can go back and remove a host from the cluster.

In addition to What we see when install request fails, In this case when a host is detected
to be in "HEARTBEAT_LOST" state, UI will display the message next to the host as {color:red}Heartbeat
lost for the host {color}. Clicking on the message will open a host pop-up that will display
the error message. Please see attached snapshot: Solution-3.png

Also an error message will be shown on the bottom of the page as {color:red}Ambari agent is
not running on <detected number> hosts.{color}{color:blue} Show Details {color} Please
see attached snapshot: Solution-1.png

Clicking on {color:blue}Show Details {color}, opens a pop-up showing a map of host to all
components on that host. Please see attached snapshot: Solution-2.png

*Assumption:* 

If no host is in "HEARTBEAT_LOST" state at the successful completion of install services request,
there will be no hostComponent in UNKNOWN or INSTALL_FAILED state.




> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>
>                 Key: AMBARI-4530
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4530
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.4
>            Reporter: Jaimin D Jetly
>            Assignee: Jaimin D Jetly
>             Fix For: 1.5.0
>
>         Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Solution-1.png, Solution-2.png,
Solution-3.png
>
>
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is down and
thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - Command is
not created for servicecomponenthost , clusterName=c1, clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER,
hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this point,
FE assumes that install succeeded and then issues a START all. That results in state change
errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught AmbariException
when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for servicecomponenthost,
clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org,
currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not issue a START
if there are master components that are in INSTALL_FAILED state. In addition, if we can show
that some hosts are in HEARTBEAT_LOST state then it can help user debug the situation. Other
option is to have BE somehow indicate that tasks did not get created for host(s). In any case,
when a host is down, we need a way to get out of the install wizard.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message