accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4359) Accumulo client stuck in infinite loop when Kerberos ticket expires
Date Thu, 07 Jul 2016 21:28:11 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366804#comment-15366804
] 

Josh Elser commented on ACCUMULO-4359:
--------------------------------------

bq. I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a
binary exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at
least lighten the load on the tservers?

Yes, I completely agree with you on the timeout/retry logic.

It's also very difficult to separate RPC level exceptions from those that are retryable and
those that are fatal.

Both of these would be a great place for improvements.

bq. I know there were some issues with older Hadoop versions... perhaps you need to update
to 2.6.4 or later?

Nah, he's saying that they didn't launch a renewal thread for their Kerberos ticket. So, after
their mapreduce job ran, they had an invalid ticket (it would be expected that they couldn't
make an RPC). We just didn't fail when this happened, but sat in a loop spinning-fast on failures.

> Accumulo client stuck in infinite loop when Kerberos ticket expires
> -------------------------------------------------------------------
>
>                 Key: ACCUMULO-4359
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4359
>             Project: Accumulo
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.7.2
>         Environment: Problem only exists when Kerberos is turned on.
>            Reporter: Russ Weeks
>            Assignee: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> If an Accumulo client tries to send an RPC to a tserver but the client's token is expired,
it will get stuck in an infinite loop [here|https://github.com/apache/accumulo/blob/1.7/core/src/main/java/org/apache/accumulo/core/client/impl/ServerClient.java#L102].
> I'm setting the priority to "minor" because it's actually pretty difficult to put the
system into this state: you have to create the client with a valid token, let the token expire,
and then try to use the client. We hit this by accident in the cleanup phase of a very long-running
MR job; the workaround (a.k.a the right way to do it) is to create a new client instead of
re-using an old client.
> On the tserver side, we get an exception like this every 100ms:
> {noformat}
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated
failure: GSS initiate failed
> 	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51)
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:360)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48)
> 	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> On the client side, no output is produced unless debug logging is turned on for o.a.a.core.client.impl.ServerClient,
in which case you see a bunch of "Failed to find TGT" errors.
> I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a
binary exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at
least lighten the load on the tservers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message