accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: Kerberos ticket renewal
Date Wed, 19 Jul 2017 16:13:16 GMT
Hah! Maybe a bug in the Hadoop client code then :)

Thanks for taking the time to post all of your findings to the list. 
This will be a very good thread for future people to refer to.

On 7/14/17 3:56 PM, James Srinivasan wrote:
> Hmm, so it seems updating the Hadoop version used by my processor from
> 2.6.0 to 2.7.3 has fixed the problem. Testing a little more just to
> make sure...
> 
> On 14 July 2017 at 13:39, James Srinivasan <james.srinivasan@gmail.com> wrote:
>> So when my code runs in a NiFi processor, the initial keytab
>> authentication works fine but following that it seems to think keytabs
>> aren't in use (UserGroupInformation.getCurrentUser.isFromKeytab is
>> false), which explains why the renewal code never actually runs and
>> why re-login is attempted using the ticket cache after the GSS
>> exception. Over to the NiFi list I think...
>>
>> Making some progress!
>>
>> On 13 July 2017 at 18:28, Josh Elser <elserj@apache.org> wrote:
>>> Aha! That's an interesting wrinkle :)
>>>
>>> I have more experience with NiFi's use of Kerberos than I care to admit (due
>>> to some folks who work in the physical office I do); I'm not aware of
>>> anything that NiFi does which would cause problems, but that may be a
>>> relevant detail.
>>>
>>> After I thought about it some more (to your #2 point): there's a little
>>> failsafe in the Accumulo client implementation that, upon a SASL
>>> authentication failure, it will attempt a relogin via Kerberos. This should
>>> "catch" the cases where your client application is using a ticket cache
>>> (because convention on the ticket cache location lets the jGSS client
>>> library in Java itself do the relogin whereas Java doesn't know which keytab
>>> to use). Still though -- a thread as you describe in #1 should have an
>>> equivalent net-effect..
>>>
>>> On 7/13/17 11:45 AM, James Srinivasan wrote:
>>>>
>>>> Thanks, just checked that and it does seem renewable (tested using
>>>> kinit -R). I'm running my code in two separate scenarios:
>>>>
>>>> 1) As part of a NiFi processor, which currently makes multiple
>>>> Accumulo connections using the same keytab, each of which currently
>>>> has a separate renewer thread
>>>> 2) As part of a simple command line application - this seems to have
>>>> no problem running for > 10 hours (even before I added the periodic
>>>> renewal code)
>>>>
>>>> Will add extra logging to #2 and try to shorten the expiry from 10
>>>> hours to 1 so I can see any difference in output.
>>>>
>>>> James
>>>>
>>>> On 13 July 2017 at 16:05, Josh Elser <elserj@apache.org> wrote:
>>>>>
>>>>> It also may be worth mentioning to check the principal's configuration
>>>>> that
>>>>> you're using in your client. Depending on which you're using and how
it
>>>>> was
>>>>> created, it may not actually support renewals.
>>>>>
>>>>> A quick test is to just `kinit` and then `kinit -R`. You can view the
>>>>> explicit "configuration" for a principal using the `kadmin` console and
>>>>> the
>>>>> `getprinc <principal>` command. Be sure to check the krbtgt/<REALM>
>>>>> principal as well:
>>>>>
>>>>> e.g.
>>>>>
>>>>> kadmin.local:  getprinc jelser
>>>>> Principal: jelser@EXAMPLE.COM
>>>>> Maximum ticket life: 1 day 00:00:00
>>>>> Maximum renewable life: 7 days 00:00:00
>>>>>
>>>>> kadmin.local:  getprinc krbtgt/EXAMPLE.COM
>>>>> Principal: krbtgt/EXAMPLE.COM@EXAMPLE.COM
>>>>> Maximum ticket life: 1 day 00:00:00
>>>>> Maximum renewable life: 7 days 00:00:00
>>>>>
>>>>> If the krbtgt/$REALM principal does not have a non-zero renewable
>>>>> lifetime,
>>>>> any other principals created in that realm would also not be allowed
to
>>>>> be
>>>>> renewed. Since you have the working "service" principals, you can
>>>>> cross-check those.
>>>>>
>>>>> On 7/13/17 10:56 AM, James Srinivasan wrote:
>>>>>>
>>>>>>
>>>>>> Yup, I am indeed on HDP - thanks for the link. The services do log
GSS
>>>>>> exceptions every ten hours, but seem to sufficiently recover
>>>>>> themselves. Having turned up logging on my client:
>>>>>>
>>>>>> 1) On client start, I see hadoop login messages
>>>>>> 2) After 8 hours (0.8*10 hours) when the renewal is expected to take
>>>>>> place, I don't see any hadoop login messages
>>>>>> 3) After 10 hours, I see GSS exceptions
>>>>>> 4) After each GSS exception, I see an attempt to renew but using
>>>>>> ticket cache, rather than keytab.
>>>>>>
>>>>>> Currently working on shortening the 10 hour expiry time so I can
catch
>>>>>> it in a debugger!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On 13 July 2017 at 15:20, Josh Elser <elserj@apache.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>> If you're using Hortonworks' HDP, you would probably benefit
from
>>>>>>> https://github.com/hortonworks/accumulo
>>>>>>>
>>>>>>> There is likely a git-tag for the exact version that you're running.
>>>>>>> The
>>>>>>> line numbers would match there.
>>>>>>>
>>>>>>> To be clear, if your services (e.g. TabletServers) aren't failing
after
>>>>>>> 10hrs, you're not running into ACCUMULO-4069. Given my (limited)
>>>>>>> understanding, your problem is purely client-side. It's possible
that
>>>>>>> the
>>>>>>> client-side RPC implementation isn't correctly handling the ticket
>>>>>>> re-login,
>>>>>>> but I know there is specifically code in there to handle the
re-login
>>>>>>> case.
>>>>>>>
>>>>>>> The next step would be getting some debug logging from your application
>>>>>>> around UserGroupInformation or the JDK itself, or just spin up
a
>>>>>>> trivial
>>>>>>> example with a small relogin window to reproduce the problem.
>>>>>>>
>>>>>>> On 7/12/17 3:48 PM, James Srinivasan wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Yup, I'm going to spin up a vanilla 1.7.0 (maybe newer) install
too to
>>>>>>>> see if it behaves any differently. There is at least one
patch
>>>>>>>> included in their distro that isn't in the formal documentation,
plus
>>>>>>>> it makes matching line numbers in logs to src code rather
difficult.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>> On 12 July 2017 at 20:37, Sean Busbey <busbey@cloudera.com>
wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi James!
>>>>>>>>>
>>>>>>>>> It sounds like you may need to chase things down with
your vendor,
>>>>>>>>> since the precise combination of patches included will
make looking
>>>>>>>>> at
>>>>>>>>> things hard for the community.
>>>>>>>>>
>>>>>>>>> On Wed, Jul 12, 2017 at 11:01 AM, James Srinivasan
>>>>>>>>> <james.srinivasan@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> So I've fired off a thread to perform the periodic
>>>>>>>>>> checkTGTAndReloginFromKeytab call which seems to
be running, but the
>>>>>>>>>> connection still fails with GSS errors after precisely
10 hours.
>>>>>>>>>>
>>>>>>>>>> While I am running 1.7.0, it seems the vendor included
the
>>>>>>>>>> ACCUMULO-4069 patch, and immediately after the exception
is thrown I
>>>>>>>>>> see a log entry "Performing ticket-cache-based Kerberos
re-login".
>>>>>>>>>> However, it should be using a keytab - have turned
up the logging to
>>>>>>>>>> 11 and will leave running overnight...
>>>>>>>>>>
>>>>>>>>>> James
>>>>>>>>>>
>>>>>>>>>> On 11 July 2017 at 16:17, Josh Elser <josh.elser@gmail.com>
wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nope, you've got it exactly right! That's the
code I would've
>>>>>>>>>>> pointed
>>>>>>>>>>> you at
>>>>>>>>>>> to copy :)
>>>>>>>>>>>
>>>>>>>>>>> If/when you do get to long-running MR jobs, see
the
>>>>>>>>>>> "general.delegation.token.*" configuration properties
in this
>>>>>>>>>>> table[1].
>>>>>>>>>>> I
>>>>>>>>>>> think the docs are citing that one delegation
token is valid for 7
>>>>>>>>>>> days, but
>>>>>>>>>>> it's been a long time since writing/testing that
code.
>>>>>>>>>>>
>>>>>>>>>>> - Josh
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_server_configuration_2
>>>>>>>>>>>
>>>>>>>>>>> On 7/11/17 1:25 AM, James Srinivasan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks both. I can't (easily) upgrade beyond
1.7.0, but have
>>>>>>>>>>>> raised
>>>>>>>>>>>> a
>>>>>>>>>>>> support case with our Hadoop distribution
vendor.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not (yet) worried about expiration with
MapReduce - for now
>>>>>>>>>>>> I'll
>>>>>>>>>>>> try to keep such jobs to under 24h! Outside
MR, sounds like I just
>>>>>>>>>>>> need to periodically call
>>>>>>>>>>>> UserGroupInformation.checkTGTAndReloginFromKeytab
like
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/accumulo/blob/master/server/base/src/main/java/org/apache/accumulo/server/security/SecurityUtil.java#L121
>>>>>>>>>>>>
>>>>>>>>>>>> Or is the TGT associated with an Accumulo
KerberosToken separate?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>>>>>>>
>>>>>>>>>>>> On 11 July 2017 at 02:59, Josh Elser <josh.elser@gmail.com>
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, you are (likely) not running into
ACCUMULO-4069. What you've
>>>>>>>>>>>>> described sounds like your client's ticket
expired. Accumulo does
>>>>>>>>>>>>> not
>>>>>>>>>>>>> spawn any ticket renewal on the behalf
of clients.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hadoop's UGI code will automatically
spawn a renewal thread when
>>>>>>>>>>>>> you
>>>>>>>>>>>>> log in using a ticket cache. This does
not happen automatically
>>>>>>>>>>>>> when
>>>>>>>>>>>>> you use a keytab (I have no explanation
as to why this is). This
>>>>>>>>>>>>> is
>>>>>>>>>>>>> the most likely cause of your error and
something you need to
>>>>>>>>>>>>> correct
>>>>>>>>>>>>> in your application (spawn a thread to
renew your application's
>>>>>>>>>>>>> ticket).
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you are using MapReduce, you have
yet another layer of
>>>>>>>>>>>>> indirection
>>>>>>>>>>>>> with DelegationTokens, but that's probably
not what you're seeing
>>>>>>>>>>>>> (as
>>>>>>>>>>>>> DelegationTokens don't actually have
a Kerberos TGT).
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:42 PM, Christopher
>>>>>>>>>>>>> <ctubbsii@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It certainly sounds like the same
issue. I'd recommend upgrading
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> latest 1.7.3 (currently the latest
1.7 version) to include all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> bugs
>>>>>>>>>>>>>> we've found and fixed in that release
line.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:50 AM James
Srinivasan
>>>>>>>>>>>>>> <james.srinivasan@gmail.com>
wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm using Accumulo 1.7.0 and
finding that after some period of
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>> (>8 hours, <3 days - happened
over the weekend) my ingest fails
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> errors regarding "Failed to find
any Kerberos tgt". My guess is
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the ticket from the keytab has
expired, and needs to be renewed
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> memory, I had seen a Kerberos
tgt renewer thread running in my
>>>>>>>>>>>>>>> client,
>>>>>>>>>>>>>>> so assumed it happened automagically.
Is that the case? Perhaps
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>> hitting this bug?
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-4069
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> James
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> busbey

Mime
View raw message