Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D4D5B200C1C for ; Tue, 10 Jan 2017 18:10:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D3A49160B3D; Tue, 10 Jan 2017 17:10:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3028E160B4B for ; Tue, 10 Jan 2017 18:10:00 +0100 (CET) Received: (qmail 58401 invoked by uid 500); 10 Jan 2017 17:09:59 -0000 Mailing-List: contact issues-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list issues@ambari.apache.org Received: (qmail 58323 invoked by uid 99); 10 Jan 2017 17:09:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jan 2017 17:09:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 2B95D2C03DC for ; Tue, 10 Jan 2017 17:09:59 +0000 (UTC) Date: Tue, 10 Jan 2017 17:09:59 +0000 (UTC) From: "Jonathan Hurley (JIRA)" To: issues@ambari.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AMBARI-19435) NodeManager restart fails during HOU if it is on same host as RM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 10 Jan 2017 17:10:01 -0000 [ https://issues.apache.org/jira/browse/AMBARI-19435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hurley updated AMBARI-19435: ------------------------------------- Attachment: AMBARI-19435.patch > NodeManager restart fails during HOU if it is on same host as RM > ---------------------------------------------------------------- > > Key: AMBARI-19435 > URL: https://issues.apache.org/jira/browse/AMBARI-19435 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.5.0 > Reporter: Jonathan Hurley > Assignee: Jonathan Hurley > Priority: Critical > Fix For: 2.5.0 > > Attachments: AMBARI-19435.patch > > > *Steps* > # Deploy HDP-2.5.0.0 cluster with Ambari-2.5.0.0 - 4 node cluster with NodeManager installed on all hosts, NN HA is enabled, RM HA is not enabled > # Register 2.5.3.0 version and install the bits > # Start HOU using API and accept manual prompts to sys-prep the hosts. Observe the wizard at restart task of host that runs RM and NM together > *Result:* > At the task to Restart Node Manager on the RM host, observed below failure: > {code} > 2016-12-20 18:32:39,446 - File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': ['delete'], 'not_if': 'ambari-sudo.sh -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'} > 2016-12-20 18:32:39,459 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/2.5.3.0-37/hadoop/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start nodemanager'] {'not_if': 'ambari-sudo.sh -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'user': 'yarn'} > 2016-12-20 18:32:40,558 - Execute['ambari-sudo.sh -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'not_if': 'ambari-sudo.sh -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'tries': 5, 'try_sleep': 1} > 2016-12-20 18:32:40,576 - Skipping Execute['ambari-sudo.sh -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] due to not_if > 2016-12-20 18:32:40,576 - Executing NodeManager Stack Upgrade post-restart > 2016-12-20 18:32:40,578 - NodeManager executing "yarn node -list -states=RUNNING" to verify the node has rejoined the cluster... > 2016-12-20 18:32:40,578 - checked_call['yarn node -list -states=RUNNING'] {'user': 'yarn'} > Command failed after 1 tries > {code} > A retry of the failed task is successful. > The issue looks due to the fact that RM is still down while we try to start NM on the host. While starting NM, we run below command to verify if NM has come up > {code} > yarn node -list -states=RUNNING > {code} > The command fails since it tries to connect to RM, resulting in timeout > As a possible fix, we may need to adjust the order in HOU upgrade pack so as to start RM before NM in such cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)