hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13433) Race in UGI.reloginFromKeytab
Date Tue, 21 Feb 2017 18:28:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876435#comment-15876435
] 

Wei-Chiu Chuang commented on HADOOP-13433:
------------------------------------------

[~daryn] would you mind to file a jira for your fix?
[~Apache9] found an issue with the test case that seems to stem from a JDK bug. Not sure if
that's what you observed. My fear is HADOOP-13433 might not be a complete solution to this
bug.

> Race in UGI.reloginFromKeytab
> -----------------------------
>
>                 Key: HADOOP-13433
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13433
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: security
>    Affects Versions: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>             Fix For: 2.9.0, 2.7.4, 2.6.6, 2.8.1, 3.0.0-alpha3
>
>         Attachments: HADOOP-13433-branch-2.7.patch, HADOOP-13433-branch-2.7-v1.patch,
HADOOP-13433-branch-2.7-v2.patch, HADOOP-13433-branch-2.8.patch, HADOOP-13433-branch-2.8.patch,
HADOOP-13433-branch-2.8-v1.patch, HADOOP-13433-branch-2.patch, HADOOP-13433.patch, HADOOP-13433-v1.patch,
HADOOP-13433-v2.patch, HADOOP-13433-v4.patch, HADOOP-13433-v5.patch, HADOOP-13433-v6.patch,
HBASE-13433-testcase-v3.patch
>
>
> This is a problem that has troubled us for several years. For our HBase cluster, sometimes
the RS will be stuck due to
> {noformat}
> 2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception encountered
while connecting to the server :
> javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid
credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>         at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194)
>         at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
>         at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
>         at org.apache.hadoop.hbase.security.User.call(User.java:607)
>         at org.apache.hadoop.hbase.security.User.access$700(User.java:51)
>         at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004)
>         at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
>         at $Proxy24.replicateLogEntries(Unknown Source)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515)
> Caused by: GSSException: No valid credentials provided (Mechanism level: The ticket isn't
for us (35) - BAD TGS SERVER NAME)
>         at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663)
>         at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
>         at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180)
>         at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175)
>         ... 23 more
> Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME
>         at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64)
>         at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185)
>         at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294)
>         at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106)
>         at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557)
>         at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594)
>         ... 26 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>         at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133)
>         at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58)
>         at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53)
>         at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46)
>         ... 31 more‚Äč
> {noformat}
> It rarely happens, but if it happens, the regionserver will be stuck and can never recover.
> Recently we added a log after a successful re-login which prints the private credentials,
and finally catched the direct reason. After a successful re-login, we have two kerberos tickets
in the credentials, one is the TGT, and the other is a service ticket. The strange thing is
that, the service ticket is placed before TGT. This breaks the assumption of jdk's kerberos
library. See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java,
the {{getTgt}} Method
> {code:title=Krb5InitCredential}
>             return AccessController.doPrivileged(
>                 new PrivilegedExceptionAction<KerberosTicket>() {
>                 public KerberosTicket run() throws Exception {
>                     // It's OK to use null as serverPrincipal. TGT is almost
>                     // the first ticket for a principal and we use list.
>                     return Krb5Util.getTicket(
>                         realCaller,
>                         clientPrincipal, null, acc);
>                         }});
> {code}
> So here, the library will use the service ticket as TGT to acquire a service ticket,
and KDC will reject the request since the 'TGT' does not start with 'krbtgt'. And it can never
recover because in UGI, the re-login will check if there is a valid TGT first and no doubt,
we have one...
> This usually happens when a secure connection initialization comes along with the re-login,
and the end time indicates that the service ticket is acquired by the previous TGT. Since
UGI does not prevent doAs and re-login happen at the same time, we believe that there is a
race condition.
> After reading the code, we found a possible race condition.
> See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java,
the {{initSecContext}} method, we will get TGT first, then check if there is already a service
ticket, if not, acquire a service ticket using the TGT, and put it into the credentials.
> And in Krb5LoginModule.logout(the sun version), we will remove the kerberos tickets from
the credentials first, and then destroy them.
> Here comes the race condition. Let T1 be the secure connection set up thread, T2 be the
re-login thread.
> T1: get TGT
> T2: remove all tickets from credentials
> T1: check service ticket, none(since all tickets have been removed)
> T1: acquire a new service ticket using TGT and put it into the credentials
> T2: destroy all tickets
> T2: login, i.e., put a new TGT into the credentials.
> It is hard to write a UT to produce the problem because the racing code is in jdk, which
is not written by us...
> Suggestions are welcomed. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message