ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hurley (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-19435) NodeManager restart fails during HOU if it is on same host as RM
Date Tue, 10 Jan 2017 01:45:58 GMT
Jonathan Hurley created AMBARI-19435:
----------------------------------------

             Summary: NodeManager restart fails during HOU if it is on same host as RM
                 Key: AMBARI-19435
                 URL: https://issues.apache.org/jira/browse/AMBARI-19435
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.5.0
            Reporter: Jonathan Hurley
            Assignee: Jonathan Hurley
            Priority: Critical
             Fix For: 2.5.0


*Steps*
# Deploy HDP-2.5.0.0 cluster with Ambari-2.5.0.0 - 4 node cluster with NodeManager installed
on all hosts, NN HA is enabled, RM HA is not enabled
# Register 2.5.3.0 version and install the bits
# Start HOU using API and accept manual prompts to sys-prep the hosts. Observe the wizard
at restart task of host that runs RM and NM together

*Result:*
At the task to Restart Node Manager on the RM host, observed below failure:
{code}
2016-12-20 18:32:39,446 - File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action':
['delete'], 'not_if': 'ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid
&& ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'}
2016-12-20 18:32:39,459 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/2.5.3.0-37/hadoop/libexec
&& /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf
start nodemanager'] {'not_if': 'ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid
&& ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid',
'user': 'yarn'}
2016-12-20 18:32:40,558 - Execute['ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid
&& ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid']
{'not_if': 'ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid
&& ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid',
'tries': 5, 'try_sleep': 1}
2016-12-20 18:32:40,576 - Skipping Execute['ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid
&& ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid']
due to not_if
2016-12-20 18:32:40,576 - Executing NodeManager Stack Upgrade post-restart
2016-12-20 18:32:40,578 - NodeManager executing "yarn node -list -states=RUNNING" to verify
the node has rejoined the cluster...
2016-12-20 18:32:40,578 - checked_call['yarn node -list -states=RUNNING'] {'user': 'yarn'}

Command failed after 1 tries
{code}

A retry of the failed task is successful. 

The issue looks due to the fact that RM is still down while we try to start NM on the host.
While starting NM, we run below command to verify if NM has come up
{code}
yarn node -list -states=RUNNING
{code}

The command fails since it tries to connect to RM, resulting in timeout
As a possible fix, we may need to adjust the order in HOU upgrade pack so as to start RM before
NM in such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message