accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4359) Accumulo client stuck in infinite loop when Kerberos ticket expires
Date Thu, 07 Jul 2016 21:19:11 GMT


Christopher Tubbs commented on ACCUMULO-4359:

I know there were some issues with older Hadoop versions... perhaps you need to update to
2.6.4 or later? Maybe [~elserj] can confirm.

> Accumulo client stuck in infinite loop when Kerberos ticket expires
> -------------------------------------------------------------------
>                 Key: ACCUMULO-4359
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.7.2
>         Environment: Problem only exists when Kerberos is turned on.
>            Reporter: Russ Weeks
>            Assignee: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
> If an Accumulo client tries to send an RPC to a tserver but the client's token is expired,
it will get stuck in an infinite loop [here|].
> I'm setting the priority to "minor" because it's actually pretty difficult to put the
system into this state: you have to create the client with a valid token, let the token expire,
and then try to use the client. We hit this by accident in the cleanup phase of a very long-running
MR job; the workaround (a.k.a the right way to do it) is to create a new client instead of
re-using an old client.
> On the tserver side, we get an exception like this every 100ms:
> {noformat}
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated
failure: GSS initiate failed
> 	at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$
> 	at Method)
> 	at
> 	at
> 	at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(
> 	at org.apache.thrift.server.TThreadPoolServer$
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(
> 	at java.util.concurrent.ThreadPoolExecutor$
> 	at
> 	at
> {noformat}
> On the client side, no output is produced unless debug logging is turned on for o.a.a.core.client.impl.ServerClient,
in which case you see a bunch of "Failed to find TGT" errors.
> I'm not sure about the best way to fix it, advice is welcome, but I'm thinking that a
binary exponential backoff (maybe capped at 30s?) instead of a retry every 100ms would at
least lighten the load on the tservers?

This message was sent by Atlassian JIRA

View raw message