ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-10518) Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not kinit'd hdfs krb cache properly
Date Thu, 16 Apr 2015 10:49:58 GMT
Hari Sekhon created AMBARI-10518:
------------------------------------

             Summary: Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode
check due to not kinit'd hdfs krb cache properly
                 Key: AMBARI-10518
                 URL: https://issues.apache.org/jira/browse/AMBARI-10518
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server, stacks
    Affects Versions: 2.0.0
         Environment: HDP 2.2.0.0 => 2.2.4.0
            Reporter: Hari Sekhon


After deploying the new HDP 2.2.4.0 stack to all nodes successfully in Ambari 2.0, the "perform
upgrade" procedure fails on the first step:
{code}Fail: 2015-04-16 11:36:32,623 - Performing a(n) upgrade of HDFS
2015-04-16 11:36:32,624 - u"Execute['/usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab
hdfs']" {}
2015-04-16 11:36:32,811 - Prepare to transition into safemode state OFF
2015-04-16 11:36:32,812 - call['su - hdfs -c 'hdfs dfsadmin -safemode get''] {}
2015-04-16 11:36:36,481 - Command: su - hdfs -c 'hdfs dfsadmin -safemode get'
Code: 255.
2015-04-16 11:36:36,481 - Error while executing command 'prepare_rolling_upgrade':
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
line 214, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
line 67, in prepare_rolling_upgrade
    namenode_upgrade.prepare_rolling_upgrade()
  File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py",
line 100, in prepare_rolling_upgrade
    raise Fail("Could not transition to safemode state %s. Please check logs to make sure
namenode is up." % str(SafeMode.OFF))
Fail: Could not transition to safemode state OFF. Please check logs to make sure namenode
is up.
2015-04-16 11:36:36,481 - Error while executing command 'prepare_rolling_upgrade':
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
line 214, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
line 67, in prepare_rolling_upgrade
    namenode_upgrade.prepare_rolling_upgrade()
  File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py",
line 100, in prepare_rolling_upgrade
    raise Fail("Could not transition to safemode state %s. Please check logs to make sure
namenode is up." % str(SafeMode.OFF))
Fail: Could not transition to safemode state OFF. Please check logs to make sure namenode
is up.{code}
It looks like this is because the Kerberos cache was not properly initialized, as I can see
an old expired cache:
{code}
# su - hdfs -c 'hdfs dfsadmin -safemode get'
15/04/16 11:42:23 WARN ipc.Client: Exception encountered while connecting to the server :
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials
provided (Mechanism level: Failed to find any Kerberos tgt)]
safemode: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)]; Host Details : local host is: "<host>/<ip>";
destination host is: "<host>":8020;
# echo $?
255
# su - hdfs
[hdfs@<host> ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1008
Default principal: hdfs@LOCALDOMAIN

Valid starting     Expires            Service principal
04/13/15 16:10:59  04/14/15 16:10:59  krbtgt/LOCALDOMAIN@LOCALDOMAIN
        renew until 04/20/15 16:10:59
[hdfs@<host> ~]$ /usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
[hdfs@<host> ~]$ logout
# su - hdfs -c 'hdfs dfsadmin -safemode get'
Safe mode is OFF in <nn1>/<ip1>:8020
Safe mode is OFF in <nn2>/<ip2>:8020
{code}
It looks like the kerberos cached was initialized for root instead of the hdfs user since
the kinit command didn't have a su - hdfs with it.

I had retried once with the same result to get the error again for this jira, but after I
logged in as hdfs and manually kinit'd the hdfs user's krb cache and retried again in Ambari
it succeeded, so that is the workaround for now.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message