accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russ Weeks (JIRA)" <>
Subject [jira] [Created] (ACCUMULO-4359) Accumulo client stuck in infinite loop when Kerberos ticket expires
Date Wed, 06 Jul 2016 16:10:11 GMT
Russ Weeks created ACCUMULO-4359:

             Summary: Accumulo client stuck in infinite loop when Kerberos ticket expires
                 Key: ACCUMULO-4359
             Project: Accumulo
          Issue Type: Bug
          Components: core
    Affects Versions: 1.7.2
         Environment: Problem only exists when Kerberos is turned on.
            Reporter: Russ Weeks
            Assignee: Russ Weeks
            Priority: Minor
             Fix For: 1.8.0

If an Accumulo client tries to send an RPC to a tserver but the client's token is expired,
it will get stuck in an infinite loop [here|].

I'm setting the priority to "minor" because it's actually pretty difficult to put the system
into this state: you have to create the client with a valid token, let the token expire, and
then try to use the client. We hit this by accident in the cleanup phase of a very long-running
MR job; the workaround (a.k.a the right way to do it) is to create a new client instead of
re-using an old client.

On the tserver side, we get an exception like this every 100ms:
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated
failure: GSS initiate failed
	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(
	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$
	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$
	at Method)
	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(
	at org.apache.thrift.server.TThreadPoolServer$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$

On the client side, no output is produced unless debug logging is turned on for o.a.a.core.client.impl.ServerClient,
in which case you see a bunch of "Failed to find TGT" errors.

I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a binary
exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at least lighten
the load on the tservers?

This message was sent by Atlassian JIRA

View raw message