accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: Kerberos ticket renewal
Date Thu, 13 Jul 2017 17:28:05 GMT
Aha! That's an interesting wrinkle :)

I have more experience with NiFi's use of Kerberos than I care to admit 
(due to some folks who work in the physical office I do); I'm not aware 
of anything that NiFi does which would cause problems, but that may be a 
relevant detail.

After I thought about it some more (to your #2 point): there's a little 
failsafe in the Accumulo client implementation that, upon a SASL 
authentication failure, it will attempt a relogin via Kerberos. This 
should "catch" the cases where your client application is using a ticket 
cache (because convention on the ticket cache location lets the jGSS 
client library in Java itself do the relogin whereas Java doesn't know 
which keytab to use). Still though -- a thread as you describe in #1 
should have an equivalent net-effect..

On 7/13/17 11:45 AM, James Srinivasan wrote:
> Thanks, just checked that and it does seem renewable (tested using
> kinit -R). I'm running my code in two separate scenarios:
> 
> 1) As part of a NiFi processor, which currently makes multiple
> Accumulo connections using the same keytab, each of which currently
> has a separate renewer thread
> 2) As part of a simple command line application - this seems to have
> no problem running for > 10 hours (even before I added the periodic
> renewal code)
> 
> Will add extra logging to #2 and try to shorten the expiry from 10
> hours to 1 so I can see any difference in output.
> 
> James
> 
> On 13 July 2017 at 16:05, Josh Elser <elserj@apache.org> wrote:
>> It also may be worth mentioning to check the principal's configuration that
>> you're using in your client. Depending on which you're using and how it was
>> created, it may not actually support renewals.
>>
>> A quick test is to just `kinit` and then `kinit -R`. You can view the
>> explicit "configuration" for a principal using the `kadmin` console and the
>> `getprinc <principal>` command. Be sure to check the krbtgt/<REALM>
>> principal as well:
>>
>> e.g.
>>
>> kadmin.local:  getprinc jelser
>> Principal: jelser@EXAMPLE.COM
>> Maximum ticket life: 1 day 00:00:00
>> Maximum renewable life: 7 days 00:00:00
>>
>> kadmin.local:  getprinc krbtgt/EXAMPLE.COM
>> Principal: krbtgt/EXAMPLE.COM@EXAMPLE.COM
>> Maximum ticket life: 1 day 00:00:00
>> Maximum renewable life: 7 days 00:00:00
>>
>> If the krbtgt/$REALM principal does not have a non-zero renewable lifetime,
>> any other principals created in that realm would also not be allowed to be
>> renewed. Since you have the working "service" principals, you can
>> cross-check those.
>>
>> On 7/13/17 10:56 AM, James Srinivasan wrote:
>>>
>>> Yup, I am indeed on HDP - thanks for the link. The services do log GSS
>>> exceptions every ten hours, but seem to sufficiently recover
>>> themselves. Having turned up logging on my client:
>>>
>>> 1) On client start, I see hadoop login messages
>>> 2) After 8 hours (0.8*10 hours) when the renewal is expected to take
>>> place, I don't see any hadoop login messages
>>> 3) After 10 hours, I see GSS exceptions
>>> 4) After each GSS exception, I see an attempt to renew but using
>>> ticket cache, rather than keytab.
>>>
>>> Currently working on shortening the 10 hour expiry time so I can catch
>>> it in a debugger!
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>> On 13 July 2017 at 15:20, Josh Elser <elserj@apache.org> wrote:
>>>>
>>>> If you're using Hortonworks' HDP, you would probably benefit from
>>>> https://github.com/hortonworks/accumulo
>>>>
>>>> There is likely a git-tag for the exact version that you're running. The
>>>> line numbers would match there.
>>>>
>>>> To be clear, if your services (e.g. TabletServers) aren't failing after
>>>> 10hrs, you're not running into ACCUMULO-4069. Given my (limited)
>>>> understanding, your problem is purely client-side. It's possible that the
>>>> client-side RPC implementation isn't correctly handling the ticket
>>>> re-login,
>>>> but I know there is specifically code in there to handle the re-login
>>>> case.
>>>>
>>>> The next step would be getting some debug logging from your application
>>>> around UserGroupInformation or the JDK itself, or just spin up a trivial
>>>> example with a small relogin window to reproduce the problem.
>>>>
>>>> On 7/12/17 3:48 PM, James Srinivasan wrote:
>>>>>
>>>>>
>>>>> Yup, I'm going to spin up a vanilla 1.7.0 (maybe newer) install too to
>>>>> see if it behaves any differently. There is at least one patch
>>>>> included in their distro that isn't in the formal documentation, plus
>>>>> it makes matching line numbers in logs to src code rather difficult.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>> On 12 July 2017 at 20:37, Sean Busbey <busbey@cloudera.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi James!
>>>>>>
>>>>>> It sounds like you may need to chase things down with your vendor,
>>>>>> since the precise combination of patches included will make looking
at
>>>>>> things hard for the community.
>>>>>>
>>>>>> On Wed, Jul 12, 2017 at 11:01 AM, James Srinivasan
>>>>>> <james.srinivasan@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> So I've fired off a thread to perform the periodic
>>>>>>> checkTGTAndReloginFromKeytab call which seems to be running,
but the
>>>>>>> connection still fails with GSS errors after precisely 10 hours.
>>>>>>>
>>>>>>> While I am running 1.7.0, it seems the vendor included the
>>>>>>> ACCUMULO-4069 patch, and immediately after the exception is thrown
I
>>>>>>> see a log entry "Performing ticket-cache-based Kerberos re-login".
>>>>>>> However, it should be using a keytab - have turned up the logging
to
>>>>>>> 11 and will leave running overnight...
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> On 11 July 2017 at 16:17, Josh Elser <josh.elser@gmail.com>
wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Nope, you've got it exactly right! That's the code I would've
pointed
>>>>>>>> you at
>>>>>>>> to copy :)
>>>>>>>>
>>>>>>>> If/when you do get to long-running MR jobs, see the
>>>>>>>> "general.delegation.token.*" configuration properties in
this
>>>>>>>> table[1].
>>>>>>>> I
>>>>>>>> think the docs are citing that one delegation token is valid
for 7
>>>>>>>> days, but
>>>>>>>> it's been a long time since writing/testing that code.
>>>>>>>>
>>>>>>>> - Josh
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>>>
>>>>>>>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_server_configuration_2
>>>>>>>>
>>>>>>>> On 7/11/17 1:25 AM, James Srinivasan wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks both. I can't (easily) upgrade beyond 1.7.0, but
have raised
>>>>>>>>> a
>>>>>>>>> support case with our Hadoop distribution vendor.
>>>>>>>>>
>>>>>>>>> I'm not (yet) worried about expiration with MapReduce
- for now I'll
>>>>>>>>> try to keep such jobs to under 24h! Outside MR, sounds
like I just
>>>>>>>>> need to periodically call
>>>>>>>>> UserGroupInformation.checkTGTAndReloginFromKeytab like
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/accumulo/blob/master/server/base/src/main/java/org/apache/accumulo/server/security/SecurityUtil.java#L121
>>>>>>>>>
>>>>>>>>> Or is the TGT associated with an Accumulo KerberosToken
separate?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>> On 11 July 2017 at 02:59, Josh Elser <josh.elser@gmail.com>
wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No, you are (likely) not running into ACCUMULO-4069.
What you've
>>>>>>>>>> described sounds like your client's ticket expired.
Accumulo does
>>>>>>>>>> not
>>>>>>>>>> spawn any ticket renewal on the behalf of clients.
>>>>>>>>>>
>>>>>>>>>> Hadoop's UGI code will automatically spawn a renewal
thread when
>>>>>>>>>> you
>>>>>>>>>> log in using a ticket cache. This does not happen
automatically
>>>>>>>>>> when
>>>>>>>>>> you use a keytab (I have no explanation as to why
this is). This is
>>>>>>>>>> the most likely cause of your error and something
you need to
>>>>>>>>>> correct
>>>>>>>>>> in your application (spawn a thread to renew your
application's
>>>>>>>>>> ticket).
>>>>>>>>>>
>>>>>>>>>> If you are using MapReduce, you have yet another
layer of
>>>>>>>>>> indirection
>>>>>>>>>> with DelegationTokens, but that's probably not what
you're seeing
>>>>>>>>>> (as
>>>>>>>>>> DelegationTokens don't actually have a Kerberos TGT).
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 10, 2017 at 5:42 PM, Christopher <ctubbsii@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It certainly sounds like the same issue. I'd
recommend upgrading
>>>>>>>>>>> to
>>>>>>>>>>> the
>>>>>>>>>>> latest 1.7.3 (currently the latest 1.7 version)
to include all the
>>>>>>>>>>> bugs
>>>>>>>>>>> we've found and fixed in that release line.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:50 AM James Srinivasan
>>>>>>>>>>> <james.srinivasan@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using Accumulo 1.7.0 and finding that
after some period of
>>>>>>>>>>>> time
>>>>>>>>>>>> (>8 hours, <3 days - happened over
the weekend) my ingest fails
>>>>>>>>>>>> with
>>>>>>>>>>>> errors regarding "Failed to find any Kerberos
tgt". My guess is
>>>>>>>>>>>> that
>>>>>>>>>>>> the ticket from the keytab has expired, and
needs to be renewed -
>>>>>>>>>>>> from
>>>>>>>>>>>> memory, I had seen a Kerberos tgt renewer
thread running in my
>>>>>>>>>>>> client,
>>>>>>>>>>>> so assumed it happened automagically. Is
that the case? Perhaps I
>>>>>>>>>>>> am
>>>>>>>>>>>> hitting this bug?
>>>>>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-4069
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> busbey

Mime
View raw message