hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-13433) Race in UGI.reloginFromKeytab
Date Wed, 27 Jul 2016 10:39:20 GMT
Duo Zhang created HADOOP-13433:
----------------------------------

             Summary: Race in UGI.reloginFromKeytab
                 Key: HADOOP-13433
                 URL: https://issues.apache.org/jira/browse/HADOOP-13433
             Project: Hadoop Common
          Issue Type: Bug
          Components: security
            Reporter: Duo Zhang


This is a problem that has troubled us for several years. For our HBase cluster, sometimes
the RS will be stuck due to

{noformat}
2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception encountered while
connecting to the server :
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials
provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194)
        at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140)
        at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187)
        at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95)
        at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325)
        at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
        at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
        at org.apache.hadoop.hbase.security.User.call(User.java:607)
        at org.apache.hadoop.hbase.security.User.access$700(User.java:51)
        at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461)
        at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321)
        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004)
        at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
        at $Proxy24.replicateLogEntries(Unknown Source)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515)
Caused by: GSSException: No valid credentials provided (Mechanism level: The ticket isn't
for us (35) - BAD TGS SERVER NAME)
        at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663)
        at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
        at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180)
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175)
        ... 23 more
Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME
        at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64)
        at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185)
        at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294)
        at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106)
        at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557)
        at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594)
        ... 26 more
Caused by: KrbException: Identifier doesn't match expected value (906)
        at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133)
        at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58)
        at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53)
        at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46)
        ... 31 more‚Äč
{noformat}

It rarely happens, but if it happens, the regionserver will be stuck and can never recover.

Recently we added a log after a successful re-login which prints the private credentials,
and finally catched the direct reason. After a successful re-login, we have two kerberos tickets
in the credentials, one is the TGT, and the other is a service ticket. The strange thing is
that, the service ticket is placed before TGT. This breaks the assumption of jdk's kerberos
library. See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java,
the {{getTgt}} Method

{code:title=Krb5InitCredential}
            return AccessController.doPrivileged(
                new PrivilegedExceptionAction<KerberosTicket>() {
                public KerberosTicket run() throws Exception {
                    // It's OK to use null as serverPrincipal. TGT is almost
                    // the first ticket for a principal and we use list.
                    return Krb5Util.getTicket(
                        realCaller,
                        clientPrincipal, null, acc);
                        }});
{code}
So here, the library will use the service ticket as TGT to acquire a service ticket, and KDC
will reject the request since the 'TGT' does not start with 'krbtgt'. And it can never recover
because in UGI, the re-login will check if there is a valid TGT first and no doubt, we have
one...

This usually happens when a secure connection initialization comes along with the re-login,
and the end time indicates that the service ticket is acquired by the previous TGT. Since
UGI does not prevent doAs and re-login happen at the same time, we believe that there is a
race condition.

After reading the code, we found a possible race condition.

See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java,
the {{initSecContext}} method, we will get TGT first, then check if there is already a service
ticket, if not, acquire a service ticket using the TGT, and put it into the credentials.

And in Krb5LoginModule.logout(the sun version), we will remove the kerberos tickets from the
credentials first, and then destroy them.

Here comes the race condition. Let T1 be the secure connection set up thread, T2 be the re-login
thread.

T1: get TGT
T2: remove all tickets from credentials
T1: check service ticket, none(since all tickets have been removed)
T1: acquire a new service ticket using TGT and put it into the credentials
T2: destroy all tickets
T2: login, i.e., put a new TGT into the credentials.

It is hard to write a UT to produce the problem because the racing code is in jdk, which is
not written by us...

Suggestions are welcomed. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Mime
View raw message