Return-Path: X-Original-To: apmail-ambari-dev-archive@www.apache.org Delivered-To: apmail-ambari-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4DC2410D01 for ; Sat, 22 Feb 2014 03:36:23 +0000 (UTC) Received: (qmail 15493 invoked by uid 500); 22 Feb 2014 03:36:21 -0000 Delivered-To: apmail-ambari-dev-archive@ambari.apache.org Received: (qmail 15423 invoked by uid 500); 22 Feb 2014 03:36:20 -0000 Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list dev@ambari.apache.org Received: (qmail 15409 invoked by uid 99); 22 Feb 2014 03:36:19 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Feb 2014 03:36:19 +0000 Date: Sat, 22 Feb 2014 03:36:19 +0000 (UTC) From: "Jaimin D Jetly (JIRA)" To: dev@ambari.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (AMBARI-4530) Cluster install errors out strangely without starting services MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909206#comment-13909206 ] Jaimin D Jetly commented on AMBARI-4530: ---------------------------------------- Patch committed to trunk. > Cluster install errors out strangely without starting services > -------------------------------------------------------------- > > Key: AMBARI-4530 > URL: https://issues.apache.org/jira/browse/AMBARI-4530 > Project: Ambari > Issue Type: Bug > Components: client > Affects Versions: 1.4.4 > Reporter: Jaimin D Jetly > Assignee: Jaimin D Jetly > Fix For: 1.5.0 > > Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Solution-1.png, Solution-2.png, Solution-3.png > > > On a two host cluster and one of the agents was down. > First INSTALL attempt fails as tasks for the down agent time out and get aborted. > When INSTALL is retried, there are no tasks created for one host (as agent is down and thus host is in HEARTBEAT_LOST state). > {noformat} > 06:38:55,649 INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - Command is not created for servicecomponenthost , clusterName=c1, clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER, hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, targetNewState=INSTALLED > {noformat} > However some tasks get created for the other agent and those succeed. At this point, FE assumes that install succeeded and then issues a START all. That results in state change errors we see in the log. > _FE assumption is based on the fact that all tasks created succeeded._ > {noformat} > 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught AmbariException when modifying a resource > org.apache.ambari.server.AmbariException: Invalid transition for servicecomponenthost, clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org, currentState=INSTALL_FAILED, newDesiredState=STARTED > {noformat} > We should discuss possible solutions. One solution could be to have FE not issue a START if there are master components that are in INSTALL_FAILED state. In addition, if we can show that some hosts are in HEARTBEAT_LOST state then it can help user debug the situation. Other option is to have BE somehow indicate that tasks did not get created for host(s). In any case, when a host is down, we need a way to get out of the install wizard. -- This message was sent by Atlassian JIRA (v6.1.5#6160)