hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf
Date Tue, 10 Apr 2018 16:12:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HADOOP-15378:
-------------------------------------
    Affects Version/s: 2.6.0

> Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15378
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15378
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: security
>    Affects Versions: 2.6.0
>         Environment: CDH5.8.3, Kerberized, Impala
>            Reporter: Wei-Chiu Chuang
>            Priority: Critical
>
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to relogin
and crashed every several hours. Initial indication suggested the symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog server still
kept crashing.
>  
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] PriviledgedActionException
as:impala/host1.example.com@EXAMPLE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first kerberos ticket
is not TGT(the server principal is hdfs/host2.example.com@EXAMPLE.COM), remove and destroy
it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first kerberos ticket
is not TGT(the server principal is hdfs/host3.example.com@EXAMPLE.COM), remove and destroy
it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no kerberos ticket
found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. Catalogd was
unable to log in because of a Kerberos issue. The JDK expects the first kerberos ticket of
a principal to be a TGT, however it seems that after this error, because it was unable to
login successfully, the first ticket was no longer a TGT. The patch HADOOP-13433 removed other
tickets of the principal, because it expects the TGT to be in the principal’s ticket, which
is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] PriviledgedActionException
as:impala/host1.example.com@EXAMPLE.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that the user’s
Kerberos ticket has expired. However, that’s definitely not the case here, since it was
just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as reported in HADOOP-15143.
>  
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while invoking
listCachePools of class ClientNamenodeProtocolTranslatorPB over host4.example.com/10.0.121.66:8020
after 2 fail over attempts. Trying to fail over immediately. Java exception follows: java.io.IOException:
Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details :
local host is: "host1.example.com/10.0.121.45"; destination host is: "host4.example.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1506)
at org.apache.hadoop.ipc.Client.call(Client.java:1439) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:55)
at org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:33)
at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77) at
org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
at org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99) at com.cloudera.impala.catalog.CatalogServiceCatalog$CachePoolReader.run(CatalogServiceCatalog.java:193)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Couldn't set up IO
streams at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:826) at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1555) at org.apache.hadoop.ipc.Client.call(Client.java:1478)
... 23 more Caused by: java.lang.NullPointerException at org.apache.hadoop.security.UserGroupInformation.fixKerberosTicketOrder(UserGroupInformation.java:1136)
at org.apache.hadoop.security.UserGroupInformation.reloginFromTicketCache(UserGroupInformation.java:1272)
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:697) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1835)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:681) at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:769) ... 26 more
> {code}
>  
> In any case, HADOOP-15143 does not fix the problem, since valid credentials were already
removed from the user.
> Also, even after HADOOP-13433 was applied, I still saw the following error:
> {noformat}
> W0113 14:33:44.254727 1255277 UserGroupInformation.java:1838] PriviledgedActionException
as:impala/host1.example.com@EXAMPLE.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level:
The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> {noformat}
>  
> We traced the SaslException back to a particular remote DataNode. Finally, it turns out
that the DataNode's host has an incorrect krb5.conf, which points to a decommissioned KDC,
inconsistent with the rest of the hosts in the cluster. Upon authentication, it seems the
error handling for the SaslException has a race condition, and occasionally it removes valid
credential from the UGI. When we corrected the krb5.conf on that host, this error disappeared.
> This is not easy to reproduce, but we are told this issue seems to occur after Impala
user performs a "refresh" command. Presumably this command forces Impala catalog server to
connect to the problematic DataNode and triggered the buggy code.
>  
> I don't have a patch to fix it but I want to raise this issue for discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message