ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Lysnichenko (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-18786) HDP Upgrade fails when the cluster size is large
Date Thu, 03 Nov 2016 14:47:58 GMT
Dmitry Lysnichenko created AMBARI-18786:
-------------------------------------------

             Summary: HDP Upgrade fails when the cluster size is large
                 Key: AMBARI-18786
                 URL: https://issues.apache.org/jira/browse/AMBARI-18786
             Project: Ambari
          Issue Type: Bug
            Reporter: Dmitry Lysnichenko
            Assignee: Dmitry Lysnichenko
         Attachments: AMBARI-18786.patch


Starting from Ambari 2.4, when the cluster is large, HDP upgrade fails during namenode restart.

This is because, restart command waits for namenode to come out of safemode and if the cluster
size is large, namenode takes more time to leave safemode but Ambari marks this action as
failure as the namenode didn't leave safemode within the configured timeout in Ambari scripts.


{code}

Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line
42, in get_value_from_jmx
return data_dict["beans"][0][property]
IndexError: list index out of range
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
line 420, in <module>
NameNode().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line
280, in execute
method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line
720, in restart
self.start(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
line 101, in start
upgrade_suspended=params.upgrade_suspended, env=env)
File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
return fn(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
line 184, in namenode
if is_this_namenode_active() is False:
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py",
line 55, in wrapper
return function(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
line 554, in is_this_namenode_active
raise Fail(format("The NameNode {namenode_id} is not listed as Active or Standby, waiting..."))
resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as Active or Standby,
waiting...
{code}

To resolve this, we increased the timeout for ambari

1. Increased the timeout in /var/lib/ambari-server/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
from this;
@retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail)
to this;
@retry(times=25, sleep_time=25, backoff_factor=2, err_class=Fail)

2. Restart Ambari server

After this upgrade went through fine.

I think its better to increase the timeout permanently so that we don't have to deal with
this issue again.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message