kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Dagostino <rndg...@gmail.com>
Subject Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate
Date Mon, 17 Sep 2018 18:45:43 GMT
<<<extra field alongside errors, not in the opaque body as mentioned in the
KIP
Right, that was a concern I had mentioned in a follow-up email.  Agree it
should go alongside errors.

<<<SCRAM for delegation token based authentication where credentials have
<<<an expiry time, so we do need to be able to set individual credential
<<<lifetimes, where the server knows the expiry time but client may not
Ah, ok, I was not aware that this case existed.  Agreed, for consistency,
server will always send it back to the client.

<<<I was trying to avoid this altogether [SaslClient negotiated property
for lifetime]
Since we now agree that the server will send it back, yes, there is no need
for this.

<<<wasn't entirely clear about the purpose of
[sasl.login.refresh.reauthenticate.enable]
I think this might have been more useful when we weren't necessarily going
to support all SASL mechanisms and/or when the broker was not going to
advertise the fact that it supported re-authentication.  You are correct,
now that we support it for all SASL mechanisms and we are bumping an API
version, I think it is okay to simply enable it wherever both the client
and server meet the required versions.

<<<wasn't entirely clear about the purpose of [connections.max.reauth.ms]
<<<wasn't expecting that we would reject
<<<tokens or tickets simply because they were too long-lived
<<<tickets are granted by a 3rd party authority
<<<Client re-authenticates even though token was not
<<<refreshed. Does this matter?
I was going under the assumption that it would matter, but based on your
pushback I just realized that the same functionality can be implemented as
part of token validation if there is a desire to limit token lifetimes to a
certain max value (and the token validator has to be provided in production
anyway since all we provide out-of-the-box is the unsecured validator).  So
I'm willing to abandon this check as part of re-authentication.

I'll adjust the KIP accordingly a bit later.  Thanks for the continued
feedback/discussion.

Ron




On Mon, Sep 17, 2018 at 2:10 PM Rajini Sivaram <rajinisivaram@gmail.com>
wrote:

> Hi Ron,
>
>
> *1) Is there a reason to communicate this value back to the client when the
> client already knows it?  It's an extra network round-trip, and since the
> SASL round-tripsare defined by the spec I'm not certain adding an extra
> round-trip is acceptable.*
>
> I wasn't suggesting an extra round-trip for propagating session lifetime. I
> was expecting session lifetime to be added to the last SASL_AUTHENTICATE
> response from the broker. Because SASL is a challenge-response mechanism,
> SaslServer knows when the response being sent is the last one and hence can
> send the session lifetime in the response (in the same way as we propagate
> errors). I was expecting this to be added as an extra field alongside
> errors, not in the opaque body as mentioned in the KIP. The opaque byte
> array strictly conforms to the SASL mechanism wire protocol and we want to
> keep it that way.
>
> As you have said, we don't need server to propagate session lifetime for
> OAUTHBEARER since client knows the token lifetime. But server also knows
> the credential lifetime and by having the server decide the lifetime, we
> can use the same code path for all mechanisms. If we only have a constant
> max lifetime in the server, for PLAIN and SCRAM we will end up having the
> same lifetime for all credentials with no ability to set actual expiry. We
> use SCRAM for delegation token based authentication where credentials have
> an expiry time, so we do need to be able to set individual credential
> lifetimes, where the server knows the expiry time but client may not.
>
> *2) I also just realized that if the client is to learn the credential
> lifetime we wouldn't want to put special-case code in the Authenticator for
> GSSAPI and OAUTHBEARER; we would want to expose the value generically,
> probably as a negotiated property on the SaslClient instance.*
>
> I was trying to avoid this altogether. Client doesn't need to know
> credential lifetime. Server asks the client to re-authenticate within its
> session lifetime.
>
> 3) From the KIP, I wasn't entirely clear about the purpose of the two
> configs:
>
> sasl.login.refresh.reauthenticate.enable: Do we need this? Client knows if
> broker version supports re-authentication based on the SASL_AUTHENTICATE
> version returned in ApiVersionsResponse. Client knows if broker is
> configured to enable re-authentication based on session lifetime returned
> in SaslAuthenticateResponse. If broker has re-authentication configured and
> client supports re-authentication, you would always want re-authenticate.
> So I wasn't sure why we need a config to opt in or out on the client-side.
>
>
> connections.max.reauth.ms: We obviously need a broker-side config. Not
> entirely sure about the semantics of the config that drives
> re-authentication. In particular, I wasn't expecting that we would reject
> tokens or tickets simply because they were too long-lived. Since tokens or
> tickets are granted by a 3rd party authority, I am not sure if clients will
> always have control over the lifetime. Do we need to support any more
> scenarios than these:
>
>
> A) reauth.ms=10,credential.lifetime.ms=10 : Broker sets
> session.lifetime=10,
> so this works.
>
> B) reauth.ms=10, credential.lifetime.ms=5 : Broker sets
> session.lifetime=5,
> so this works.
> C) reauth.ms=10, credential.lifetime.ms=20 : Broker sets
> session.lifetime=10. Client re-authenticates even though token was not
> refreshed. Does this matter?
> D) reauth.ms=Long.MAX_VALUE, credential.lifetime.ms=10: Broker sets
> session.lifetime=10, client re-authenticates based on credential expiry.
> E) reauth.ms=0 (default), credential.lifetime.ms=10 : Broker sets
> session.lifetime=0, Broker doesn't terminate sessions, client doesn't
> re-authenticate. We generate useful metrics.
> F) reauth.ms=0 (default),no lifetime for credential (e.g. PLAIN): Broker
> sets session.lifetime=0, Broker doesn't terminate sessions, client doesn't
> re-authenticate
> G) reauth.ms=10,no lifetime for credential (e.g. PLAIN) : Broker sets
> session.lifetime=10. Client re-authenticates.
>
> I would have thought that D) is the typical scenario for OAuth/Kerberos to
> respect token expiry time. G) would be typical scenario for PLAIN to force
> re-authenication at regular intervals. A/B/C are useful to force
> re-authentication in scenarios where you might check for credential
> revocation in the server. And E/F are useful to disable re-authentication
> and generate metrics (also the default behaviour useful during migration).
> Have I missed something?
>
>
> On Mon, Sep 17, 2018 at 4:27 PM, Ron Dagostino <rndgstn@gmail.com> wrote:
>
> > Hi yet again, Rajini.  I also just realized that if the client is to
> learn
> > the credential lifetime we wouldn't want to put special-case code in the
> > Authenticator for GSSAPI and OAUTHBEARER; we would want to expose the
> value
> > generically, probably as a negotiated property on the SaslClient
> instance.
> > We might be talking about making the negotiated property key part of the
> > public API.  In other words, the SaslClient would be responsible for
> > exposing the credential (i.e. token or ticket) lifetime at a well-known
> > negotiated property name, such as "Credential.Lifetime" and putting a
> Long
> > value there if there is a lifetime.  That well-klnown key (e.g.
> > "Credential.Lifetime") would be part of the public API, right?
> >
> > Ron
> >
> > On Mon, Sep 17, 2018 at 11:03 AM Ron Dagostino <rndgstn@gmail.com>
> wrote:
> >
> > > Hi again, Rajini.  After thinking about this a little while, it occurs
> to
> > > me that maybe the communication of max session lifetime should occur
> via
> > > SASL_HANDSHAKE after all.  Here's why.  The value communicated is the
> max
> > > session lifetime allowed, and the client can assume it is the the
> session
> > > lifetime for that particular session unless the particular SASL
> mechanism
> > > could result in a shorter session that would be obvious to the client
> and
> > > the server.  In particular, for OAUTHBEARER, the session lifetime will
> be
> > > the token lifetime, which the client and server will both know.  Is
> > there a
> > > reason to communicate this value back to the client when the client
> > already
> > > knows it?  It's an extra network round-trip, and since the SASL
> > round-trips
> > > are defined by the spec I'm not certain adding an extra round-trip is
> > > acceptable.  Even if we decide we can add it, it helps with latency if
> we
> > > don't.
> > >
> > > Kerberos may be a bit different -- I don't know if the broker can learn
> > > the session lifetime.  If it can then the same thing holds -- both
> client
> > > and server will know the session lifetime and there is no reason to
> > > communicate it back.  If the server can't learn the lifetime then I
> don't
> > > think adding an extra SASL_AUTHENTICATE round trip is going to help,
> > anyway.
> > >
> > > Also, by communicating the max session lifetime in the SASL_HANDSHAKE
> > > response, both OAUTHBEARER and GSSAPI clients will be able to know
> before
> > > sending any SASL_AUTHENTICATE requests whether their credential
> violates
> > > the maximum.  This allows a behaving client to give a good error
> message.
> > > A malicious client would ignore the value and send a longer-lived
> > > credential, and then that would be rejected on the server side.
> > >
> > > I'm still good with ExpiringCredential not needing to be public.
> > >
> > > What do you think?
> > >
> > > Ron
> > >
> > > Ron
> > >
> > > Ron
> > >
> > > On Mon, Sep 17, 2018 at 9:13 AM Ron Dagostino <rndgstn@gmail.com>
> wrote:
> > >
> > >> Hi Rajini.  I've updated the KIP to reflect the decision to fully
> > support
> > >> this for all SASL mechanisms and to not require the ExpiringCredential
> > >> interface to be public.
> > >>
> > >> Ron
> > >>
> > >> On Mon, Sep 17, 2018 at 6:55 AM Ron Dagostino <rndgstn@gmail.com>
> > wrote:
> > >>
> > >>> Actually, I think the metric remains the same assuming the
> > >>> authentication fails when the token  lifetime is too long.  Negative
> > max
> > >>> config on server counts what would have been killed because maybe
> > client
> > >>> re-auth is not turned on; positive max enables the kill, which is
> > counted
> > >>> by a second metric as proposed.  The current proposal had already
> > stated
> > >>> that a non-zero value would disallow an authentication with a token
> > that
> > >>> has too large a lifetime, and that is still the case, I think.  But
> > let’s
> > >>> make sure we are on the same page about all this.
> > >>>
> > >>> Ron
> > >>>
> > >>> > On Sep 17, 2018, at 6:42 AM, Ron Dagostino <rndgstn@gmail.com>
> > wrote:
> > >>> >
> > >>> > Hi Rajini.  I see what you are saying.  The background login
> refresh
> > >>> thread does have to factor into the decision for OAUTHBEARER because
> > there
> > >>> is no new token to re-authenticate with until that refresh succeeds
> > >>> (similarly with Kerberos), but I think you are right that the
> interface
> > >>> doesn’t necessarily have to be public.  The server does decide the
> > time for
> > >>> PLAIN and the SCRAM-related mechanisms under the current proposal,
> but
> > it
> > >>> is done in SaslHandshakeResponse, and at that point the server won’t
> > have
> > >>> any token yet; making it happen via SaslAuthenticateRequest at the
> > very end
> > >>> does allow the server to know everything for all mechanisms.
> > >>> >
> > >>> > I see two potential issues to discuss.  First, the server must
> > >>> communicate a time that exceeds the (token or ticket) refresh period
> > on the
> > >>> client.  Assuming it communicates the expiration time of the
> > token/ticket,
> > >>> that’s the best it can do.  So I think that’ll be fine.
> > >>> >
> > >>> > The second issue is what happens if the server is configured to
> > accept
> > >>> a max value — say, an hour — and the token is good for longer.  I
> > assumed
> > >>> that the client should not be allowed to authenticate in the first
> > place
> > >>> because it could then simply re-authenticate with the same token
> after
> > an
> > >>> hour, which defeats the motivations for this KIP.  So do we agree the
> > max
> > >>> token lifetime allowed will be enforced at authentication time?
> > Assuming
> > >>> so, then we need to discuss migration.
> > >>> >
> > >>> > The current proposal includes a metric that can be used to identify
> > if
> > >>> an OAUTHBEARER client is misconfigured — the number of connections
> that
> > >>> would have been killed (could be non-zero when the configured max is
> a
> > >>> negative number). Do we still want this type of an indication, and if
> > so,
> > >>> is it still done the same way — a negative number for max, but
> instead
> > of
> > >>> counting the number of kills that would have been done it counts the
> > number
> > >>> of authentications that would have been failed?
> > >>> >
> > >>> > Ron
> > >>> >
> > >>> >> On Sep 17, 2018, at 6:06 AM, Rajini Sivaram <
> > rajinisivaram@gmail.com>
> > >>> wrote:
> > >>> >>
> > >>> >> Hi Ron,
> > >>> >>
> > >>> >> Thinking a bit more about other SASL mechanisms, I think the issue
> > is
> > >>> that
> > >>> >> in the current proposal, clients decide the re-authentication
> > period.
> > >>> For
> > >>> >> mechanisms like PLAIN or SCRAM, we would actually want server to
> > >>> determine
> > >>> >> the re-authentication interval. For example, the PLAIN or SCRAM
> > >>> database
> > >>> >> could specify the expiry time for each principal, or the broker
> > could
> > >>> be
> > >>> >> configured with a single expiry time for all principals of that
> > >>> mechanism.
> > >>> >> For OAuth, brokers do know the expiry time. Haven't figured out
> what
> > >>> to do
> > >>> >> with Kerberos, but in any case for broker-side connection
> > termination
> > >>> on
> > >>> >> expiry, we need the broker to know/decide the expiry time. So I
> > would
> > >>> like
> > >>> >> to suggest a slightly different approach to managing
> > >>> re-authentications.
> > >>> >>
> > >>> >> 1) Instead of changing SASL_HANDSHAKE version number, we bump up
> > >>> >> SASL_AUTHENTICATE version number.
> > >>> >> 2) In the final SASL_AUTHENTICATE response, broker returns the
> > expiry
> > >>> time
> > >>> >> of this authenticated session. This is the interval after which
> > >>> broker will
> > >>> >> terminate the connection If it wasn't re-authenticated.
> > >>> >> 3) We no longer need the public interface change to add `
> > >>> >> org.apache.kafka.common.security.expiring.ExpiringCredential` for
> > the
> > >>> >> client-side. Instead we schedule the next re-authentication on the
> > >>> client
> > >>> >> based on the expiry time provided by the server (some time earlier
> > >>> than the
> > >>> >> expiry).
> > >>> >> 4) If client uses SASL_AUTHENTICATE v0, broker will not return
> > expiry
> > >>> time.
> > >>> >> The connection will be terminated if that feature is enabled (the
> > >>> same code
> > >>> >> path as client failing to re-authenticte on time).
> > >>> >>
> > >>> >> Thoughts?
> > >>> >>
> > >>> >>> On Sat, Sep 15, 2018 at 3:06 AM, Ron Dagostino <
> rndgstn@gmail.com>
> > >>> wrote:
> > >>> >>>
> > >>> >>> Hi everyone.  I've updated the KIP to reflect all discussion to
> > date,
> > >>> >>> including the decision to go with the low-level approach.  This
> > >>> latest
> > >>> >>> version of the KIP includes the above "connections.max.reauth.ms
> "
> > >>> >>> proposal,
> > >>> >>> which I know has not been discussed yet.  It mentions new
> metrics,
> > >>> some of
> > >>> >>> which may not have been discussed either (and names are missing
> > from
> > >>> some
> > >>> >>> of them).  Regardless, this new version is the closest yet to a
> > >>> version
> > >>> >>> that can be put to a vote next week.
> > >>> >>>
> > >>> >>> Ron
> > >>> >>>
> > >>> >>>> On Fri, Sep 14, 2018 at 8:59 PM Ron Dagostino <
> rndgstn@gmail.com>
> > >>> wrote:
> > >>> >>>>
> > >>> >>>> Minor correction: I'm proposing "connections.max.reauth.ms" as
> > the
> > >>> >>>> broker-side configuration property, not "
> > >>> >>>> connections.max.expired.credentials.ms".
> > >>> >>>>
> > >>> >>>> Ron
> > >>> >>>>
> > >>> >>>>> On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino <
> rndgstn@gmail.com
> > >
> > >>> wrote:
> > >>> >>>>>
> > >>> >>>>> Hi Rajini.  I'm going to choose *
> > >>> connections.max.expired.credentials.ms
> > >>> >>>>> <http://connections.max.expired.credentials.ms>* as the option
> > >>> name
> > >>> >>>>> because it is consistent with the comment I made in the section
> > >>> about
> > >>> >>> how
> > >>> >>>>> to get the client and server to agree on credential lifetime:
> > >>> >>>>>
> > >>> >>>>> "We could add a new API call so that clients could ask servers
> > for
> > >>> the
> > >>> >>>>> lifetime they use, or we could extend the
> > >>> SaslHandshakeRequest/Response
> > >>> >>> API
> > >>> >>>>> call to include that information in the server's response – the
> > >>> client
> > >>> >>>>> would then adopt that value"
> > >>> >>>>>
> > >>> >>>>>
> > >>> >>>>> We set the config option value on the broker (with the "
> > >>> >>>>> listener.name.mechanism." prefix), and we will return the
> > >>> configured
> > >>> >>>>> value in the SaslHandshakeResponse (requiring a wire format
> > change
> > >>> in
> > >>> >>>>> addition to a version bump).  The value (referred to as "X"
> > below)
> > >>> can
> > >>> >>> be
> > >>> >>>>> negative, zero, or positive and is to be interpreted as
> follows:
> > >>> >>>>>
> > >>> >>>>> X = 0: Fully disable.  The server will respond to
> > >>> re-authentications,
> > >>> >>> but
> > >>> >>>>> it won't kill any connections due to expiration, and it won't
> > track
> > >>> >>> either
> > >>> >>>>> of the 2 metrics mentioned below.
> > >>> >>>>>
> > >>> >>>>> Now, a couple of definitions for when X != 0:
> > >>> >>>>>
> > >>> >>>>> 1) We define the *maximum allowed expiration time* to be the
> time
> > >>> >>>>> determined by the point when (re-)authentication occurs plus
> |X|
> > >>> >>>>> milliseconds.
> > >>> >>>>> 2) We define the *requested expiration time* to be the maximum
> > >>> allowed
> > >>> >>>>> expiration time except for the case of an OAuth bearer token,
> in
> > >>> which
> > >>> >>> case
> > >>> >>>>> it is the point at which the token expires.  (Kerberos
> presumably
> > >>> also
> > >>> >>>>> specifies a ticket lifetime, but frankly I am not in a position
> > to
> > >>> do
> > >>> >>> any
> > >>> >>>>> Kerberos-related coding in time for a 2.1.0 release and would
> > >>> prefer to
> > >>> >>>>> ignore this piece of information for this KIP -- would it be
> > >>> acceptable
> > >>> >>> to
> > >>> >>>>> have someone else add it later?).
> > >>> >>>>>
> > >>> >>>>> Based on these definitions, we define the behavior as follows:
> > >>> >>>>>
> > >>> >>>>> *X > 0: Fully enable*
> > >>> >>>>> A) The server will reject any authentication/re-authentication
> > >>> attempt
> > >>> >>>>> when the requested expiration time is after the maximum allowed
> > >>> >>> expiration
> > >>> >>>>> time (which could only happen with an OAuth bearer token,
> > assuming
> > >>> we
> > >>> >>> skip
> > >>> >>>>> dealing with Kerberos for now).
> > >>> >>>>> B) The server will kill connections that are used beyond the
> > >>> requested
> > >>> >>>>> expiration time.
> > >>> >>>>> C) A broker metric will be maintained that documents the number
> > >>> >>>>> connections killed by the broker.  This metric will be non-zero
> > if
> > >>> a
> > >>> >>> client
> > >>> >>>>> is connecting to the broker with re-authentication either
> > >>> unavailable
> > >>> >>> (i.e.
> > >>> >>>>> an older client) or disabled.
> > >>> >>>>>
> > >>> >>>>> *X < 0: Partially enable*
> > >>> >>>>> A) Same as above: the server will reject any
> > >>> >>>>> authentication/re-authentication attempt when the requested
> > >>> expiration
> > >>> >>> time
> > >>> >>>>> is after the maximum allowed expiration time (which could only
> > >>> happen
> > >>> >>> with
> > >>> >>>>> an OAuth bearer token, assuming we skip dealing with Kerberos
> for
> > >>> now).
> > >>> >>>>> B) The server will **not** kill connections that are used
> beyond
> > >>> the
> > >>> >>>>> requested expiration time.
> > >>> >>>>> C) A broker metric will be maintained that documents the number
> > of
> > >>> API
> > >>> >>>>> requests unrelated to re-authentication that are made over a
> > >>> connection
> > >>> >>>>> beyond the requested expiration time.  This metric will be
> > helpful
> > >>> for
> > >>> >>>>> operators to ensure that all clients are properly upgraded and
> > >>> >>>>> re-authenticating before fully enabling the server-side
> > >>> >>>>> expired-connection-kill functionality (i.e. by changing the
> value
> > >>> from a
> > >>> >>>>> negative number to a positive number): this metric will be zero
> > >>> across
> > >>> >>> all
> > >>> >>>>> brokers when it is safe to fully enable the server-side
> feature.
> > >>> >>>>>
> > >>> >>>>> Thoughts?
> > >>> >>>>>
> > >>> >>>>> Ron
> > >>> >>>>>
> > >>> >>>>> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram <
> > >>> >>> rajinisivaram@gmail.com>
> > >>> >>>>> wrote:
> > >>> >>>>>
> > >>> >>>>>> It will be good to shorten the config name. We have a config
> > >>> named `
> > >>> >>>>>> connections.max.idle.ms`. We could add something like `
> > >>> >>>>>> connections.max.expired.credentials.ms`. As you said, it
> would
> > be
> > >>> >>>>>> prefixed
> > >>> >>>>>> with listener and SASL mechanism name. We should be able to
> > >>> support
> > >>> >>>>>> connection termination in future even with SSL, so perhaps we
> > >>> don't
> > >>> >>> want
> > >>> >>>>>> the `sasl.server.` prefix (we just need to validate whether
> the
> > >>> config
> > >>> >>> is
> > >>> >>>>>> supported for the protocol). You can choose whether a boolean
> > >>> flag or a
> > >>> >>>>>> time interval makes more sense.
> > >>> >>>>>>
> > >>> >>>>>> For the client-side, the KIP explains how to make it work for
> > >>> other
> > >>> >>>>>> mechanisms and we can leave it that. Not convinced about
> > >>> server-side
> > >>> >>>>>> though. It seems to me that we probably would require API
> > changes
> > >>> to
> > >>> >>> make
> > >>> >>>>>> it work. Basically the proposed approach works only for
> > >>> OAUTHBEARER.
> > >>> >>>>>> Since
> > >>> >>>>>> we have to bump up SaslHandshakeRequest version in this KIP,
> it
> > >>> will be
> > >>> >>>>>> good to work out if we need to change the request or flow to
> > make
> > >>> this
> > >>> >>>>>> work
> > >>> >>>>>> for other mechanisms. I haven't figured out exactly what is
> > >>> needed, but
> > >>> >>>>>> will think about it and get back next week. In the meantime,
> you
> > >>> can
> > >>> >>> get
> > >>> >>>>>> the KIP up-to-date with the config, migration path etc. and
> get
> > it
> > >>> >>> ready
> > >>> >>>>>> to
> > >>> >>>>>> start vote next week.
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino <
> > rndgstn@gmail.com
> > >>> >
> > >>> >>>>>> wrote:
> > >>> >>>>>>
> > >>> >>>>>>> Hi Rajini.
> > >>> >>>>>>>
> > >>> >>>>>>> Agreed, we will bump the SaslHandshakeRequest API number (no
> > wire
> > >>> >>>>>> format
> > >>> >>>>>>> change, of course).
> > >>> >>>>>>>
> > >>> >>>>>>> Agreed, responding to re-authentication will always be
> enabled
> > >>> on the
> > >>> >>>>>>> broker since it is initiated by the client.
> > >>> >>>>>>>
> > >>> >>>>>>> Agreed, connection termination by the broker will be opt-in.
> > I'm
> > >>> >>>>>> thinking
> > >>> >>>>>>> *sasl.server.disconnect.expired.credential.connections.
> > enable*,
> > >>> >>>>>> prefixed
> > >>> >>>>>>> with listener and SASL mechanism name in lower-case so it can
> > be
> > >>> >>>>>> opt-in at
> > >>> >>>>>>> the granularity of a SASL mechanism; for example, "
> > >>> >>>>>>> listener.name.sasl_ssl.oauthbearer.sasl.server.
> > >>> >>>>>>> disconnect.expired.credential.connections.enable=true
> > >>> >>>>>>> ".  It is long, so if you have a preference for a shorter
> name
> > >>> let me
> > >>> >>>>>> know
> > >>> >>>>>>> and we'll go with it instead.
> > >>> >>>>>>>
> > >>> >>>>>>> Agreed, additional metrics will be helpful.  I'll add them to
> > the
> > >>> >>> KIP.
> > >>> >>>>>>>
> > >>> >>>>>>> <<<need to work out exactly how this works with all SASL
> > >>> mechanisms
> > >>> >>>>>>> <<<not sure we have the data required on the broker or a way
> of
> > >>> >>>>>>> <<< extending the [GSSAPI] mechanism
> > >>> >>>>>>> Not sure I agree here.  The paragraph I added to the KIP
> > >>> describes
> > >>> >>> how
> > >>> >>>>>> I
> > >>> >>>>>>> think this can be done.  Given that description, do you still
> > >>> feel
> > >>> >>>>>> Kerberos
> > >>> >>>>>>> will not be possible?  If Kerberos presents a significant
> > hurdle
> > >>> >>> then I
> > >>> >>>>>>> don't think that should prevent us from doing it for other
> > >>> mechanisms
> > >>> >>>>>> -- I
> > >>> >>>>>>> would rather state that it isn't supported with GSSAPI due to
> > >>> >>>>>> limitations
> > >>> >>>>>>> in Kerberos itself than not have it for OAUTHBEARER, for
> > example.
> > >>> >>> And
> > >>> >>>>>>> right now I don't think we need more than a high-level
> > >>> description of
> > >>> >>>>>> how
> > >>> >>>>>>> this could be done.  In other words, I think we have this
> > >>> covered,
> > >>> >>>>>> unless
> > >>> >>>>>>> there is a fundamental problem due to Kerberos not making
> data
> > >>> (i.e.
> > >>> >>>>>> ticket
> > >>> >>>>>>> expiration) available on the server.   I want to submit this
> > KIP
> > >>> for
> > >>> >>> a
> > >>> >>>>>> vote
> > >>> >>>>>>> early next week in the hope of having it accepted by the
> > Monday,
> > >>> 9/24
> > >>> >>>>>>> deadline for the 2.1.0 release, and if that happens I
> believe I
> > >>> can
> > >>> >>>>>> get a
> > >>> >>>>>>> PR into really good shape soon thereafter (I'm working on it
> > >>> now).
> > >>> >>>>>>>
> > >>> >>>>>>> Ron
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <
> > >>> >>>>>> rajinisivaram@gmail.com>
> > >>> >>>>>>> wrote:
> > >>> >>>>>>>
> > >>> >>>>>>>> Hi Ron,
> > >>> >>>>>>>>
> > >>> >>>>>>>> 1) Yes, we should update the version of
> > `SaslHandshakeRequest`.
> > >>> >>>>>> Clients
> > >>> >>>>>>>> would attempt to re-authenticate only if broker's
> > >>> >>>>>> SaslHandshakeRequest
> > >>> >>>>>>>> version >=2.
> > >>> >>>>>>>> 2) I think we should enable broker's re-authentication and
> > >>> >>> connection
> > >>> >>>>>>>> termination code regardless of client version.
> > Re-authentication
> > >>> >>>>>> could be
> > >>> >>>>>>>> always enabled since it is triggered only by clients and
> > broker
> > >>> >>> could
> > >>> >>>>>>> just
> > >>> >>>>>>>> handle it. Connection termination should be opt-in, but
> should
> > >>> work
> > >>> >>>>>> with
> > >>> >>>>>>>> clients of any version. If turned on and clients are of an
> > older
> > >>> >>>>>> version,
> > >>> >>>>>>>> this would simply force reconnection, which should be ok.
> > >>> >>> Additional
> > >>> >>>>>>>> metrics to monitor expired connections may be useful anyway.
> > >>> >>>>>>>>
> > >>> >>>>>>>> We also need to work out exactly how this works with all
> SASL
> > >>> >>>>>> mechanisms.
> > >>> >>>>>>>> In particular, we have ensure that we can make it work with
> > >>> >>> Kerberos
> > >>> >>>>>>> since
> > >>> >>>>>>>> we use the implementation from the JRE and I am not sure we
> > have
> > >>> >>> the
> > >>> >>>>>> data
> > >>> >>>>>>>> required on the broker or a way of extending the mechanism.
> > >>> >>>>>>>>
> > >>> >>>>>>>> Thoughts?
> > >>> >>>>>>>>
> > >>> >>>>>>>> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <
> > >>> rndgstn@gmail.com>
> > >>> >>>>>>> wrote:
> > >>> >>>>>>>>
> > >>> >>>>>>>>> Hi Rajini.  I'm thinking about how we deal with migration.
> > >>>  For
> > >>> >>>>>>> example,
> > >>> >>>>>>>>> let's say we have an existing 2.0.0 cluster using
> > >>> >>> SASL/OAUTHBEARER
> > >>> >>>>>> and
> > >>> >>>>>>> we
> > >>> >>>>>>>>> want to use this feature.  The desired end state is to have
> > all
> > >>> >>>>>> clients
> > >>> >>>>>>>> and
> > >>> >>>>>>>>> brokers migrated to a version that supports the feature
> (call
> > >>> it
> > >>> >>>>>> 2.x)
> > >>> >>>>>>>> with
> > >>> >>>>>>>>> the feature turned on.  We need to document the supported
> > >>> path(s)
> > >>> >>>>>> to
> > >>> >>>>>>> this
> > >>> >>>>>>>>> end state.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> Here are a couple of scenarios with implications:
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> 1) *Migrate client to 2.x and turn the feature on but
> broker
> > is
> > >>> >>>>>> still
> > >>> >>>>>>> at
> > >>> >>>>>>>>> 2.0.*  In this case the client is going to get an error
> when
> > it
> > >>> >>>>>> sends
> > >>> >>>>>>> the
> > >>> >>>>>>>>> SaslHandshakeRequest.  We want to avoid this.  It seems to
> me
> > >>> the
> > >>> >>>>>>> client
> > >>> >>>>>>>>> needs to know if the broker has been upgraded to the
> > necessary
> > >>> >>>>>> version
> > >>> >>>>>>>>> before trying to re-authenticate, which makes me believe we
> > >>> need
> > >>> >>> to
> > >>> >>>>>>> bump
> > >>> >>>>>>>>> the API version number in 2.x and the client has to check
> to
> > >>> see
> > >>> >>>>>> if the
> > >>> >>>>>>>> max
> > >>> >>>>>>>>> version the broker supports meets a minimum standard before
> > >>> >>> trying
> > >>> >>>>>> to
> > >>> >>>>>>>>> re-authenticate.  Do you agree?
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> 2) *Migrate broker to 2.x and turn the feature on but
> client
> > is
> > >>> >>>>>> still
> > >>> >>>>>>> at
> > >>> >>>>>>>>> 2.0*.  In this case the broker is going to end up killing
> > >>> >>>>>> connections
> > >>> >>>>>>>>> periodically.   Again, we want to avoid this.  It's
> tempting
> > to
> > >>> >>> say
> > >>> >>>>>>>> "don't
> > >>> >>>>>>>>> do this" but I wonder if we can require installations to
> > >>> upgrade
> > >>> >>>>>> all
> > >>> >>>>>>>>> clients before turning the feature on at the brokers.
> > >>> Certainly
> > >>> >>>>>> we can
> > >>> >>>>>>>> say
> > >>> >>>>>>>>> "don't do this" for inter-broker clients -- in other words,
> > we
> > >>> >>> can
> > >>> >>>>>> say
> > >>> >>>>>>>> that
> > >>> >>>>>>>>> all brokers in a cluster should be upgraded before the
> > feature
> > >>> is
> > >>> >>>>>>> turned
> > >>> >>>>>>>> on
> > >>> >>>>>>>>> for any one of them -- but I don't know about our ability
> to
> > >>> say
> > >>> >>>>>> it for
> > >>> >>>>>>>>> non-broker clients.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> Now we consider the cases where both the brokers and the
> > >>> clients
> > >>> >>>>>> have
> > >>> >>>>>>>> been
> > >>> >>>>>>>>> upgraded to 2.x.  When and where should the feature be
> turned
> > >>> on?
> > >>> >>>>>> The
> > >>> >>>>>>>>> inter-broker case is interesting because I don't think
> think
> > we
> > >>> >>> can
> > >>> >>>>>>>> require
> > >>> >>>>>>>>> installations to restart every broker with a new config
> where
> > >>> the
> > >>> >>>>>>> feature
> > >>> >>>>>>>>> is turned on at the same time.  Do you agree that we cannot
> > >>> >>> require
> > >>> >>>>>>> this
> > >>> >>>>>>>>> and therefore must support installations restarting brokers
> > >>> with
> > >>> >>> a
> > >>> >>>>>> new
> > >>> >>>>>>>>> config at whatever pace they need -- which may be quite
> slow
> > >>> >>>>>> relative
> > >>> >>>>>>> to
> > >>> >>>>>>>>> token lifetimes?  Assuming you do agree, then there is
> going
> > to
> > >>> >>> be
> > >>> >>>>>> the
> > >>> >>>>>>>> case
> > >>> >>>>>>>>> where some brokers are going to have the feature turned on
> > and
> > >>> >>> some
> > >>> >>>>>>>> clients
> > >>> >>>>>>>>> (definitely inter-broker clients at a minimum) are going to
> > >>> have
> > >>> >>>>>> the
> > >>> >>>>>>>>> feature turned off.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> The above discussion assumes a single config on the broker
> > side
> > >>> >>>>>> that
> > >>> >>>>>>>> turns
> > >>> >>>>>>>>> on both the inter-broker client re-authentication feature
> as
> > >>> well
> > >>> >>>>>> as
> > >>> >>>>>>> the
> > >>> >>>>>>>>> server-side expired-connection-kill feature, but now I'm
> > >>> thinking
> > >>> >>>>>> we
> > >>> >>>>>>> need
> > >>> >>>>>>>>> the ability to turn these features on independently, plus
> > maybe
> > >>> >>> we
> > >>> >>>>>>> need a
> > >>> >>>>>>>>> way to monitor the number of "active, expired" connections
> to
> > >>> the
> > >>> >>>>>>> server
> > >>> >>>>>>>> so
> > >>> >>>>>>>>> that operators can be sure that all clients have been
> > >>> >>>>>> upgraded/enabled
> > >>> >>>>>>>>> before turning on the server-side expired-connection-kill
> > >>> >>> feature.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> So the migration plan would be as follows:
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> 1) Upgrade all brokers to 2.x.
> > >>> >>>>>>>>> 2) After all brokers are upgraded, turn on
> re-authentication
> > >>> for
> > >>> >>>>>> all
> > >>> >>>>>>>>> brokers at whatever rate is desired -- just eventually, at
> > some
> > >>> >>>>>> point,
> > >>> >>>>>>>> get
> > >>> >>>>>>>>> the client-side feature turned on for all brokers so that
> > >>> >>>>>> inter-broker
> > >>> >>>>>>>>> connections are re-authenticating.
> > >>> >>>>>>>>> 3) In parallel with (1) and (2) above, upgrade clients to
> 2.x
> > >>> and
> > >>> >>>>>> turn
> > >>> >>>>>>>>> their re-authentication feature on.  Clients will check the
> > API
> > >>> >>>>>> version
> > >>> >>>>>>>> and
> > >>> >>>>>>>>> only re-authenticate to a broker that has also been
> upgraded
> > >>> >>> (note
> > >>> >>>>>> that
> > >>> >>>>>>>> the
> > >>> >>>>>>>>> ability of a broker to respond to a re-authentication
> cannot
> > be
> > >>> >>>>>> turned
> > >>> >>>>>>>> off
> > >>> >>>>>>>>> -- it is always on beginning with version 2.x, and it just
> > sits
> > >>> >>>>>> there
> > >>> >>>>>>>> doing
> > >>> >>>>>>>>> nothing if it isn't exercised by an enabled client)
> > >>> >>>>>>>>> 4) After (1), (2), and (3) are complete, check the 2.x
> broker
> > >>> >>>>>> metrics
> > >>> >>>>>>> to
> > >>> >>>>>>>>> confirm that there are no "active, expired" connections.
> > Once
> > >>> >>> you
> > >>> >>>>>> are
> > >>> >>>>>>>>> satisfied that (1), (2), and (3) are indeed complete you
> can
> > >>> >>>>>> enable the
> > >>> >>>>>>>>> server-side expired-connection-kill feature on each broker
> > via
> > >>> a
> > >>> >>>>>>> restart
> > >>> >>>>>>>> at
> > >>> >>>>>>>>> whatever pace is desired.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> Comments?
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> Ron
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <
> > >>> rndgstn@gmail.com
> > >>> >>>>
> > >>> >>>>>>> wrote:
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>> Ok, I am tempted to just say we go with the low-level
> > approach
> > >>> >>>>>> since
> > >>> >>>>>>> it
> > >>> >>>>>>>>> is
> > >>> >>>>>>>>>> the quickest and seems to meet the clear requirements. We
> > can
> > >>> >>>>>> always
> > >>> >>>>>>>>> adjust
> > >>> >>>>>>>>>> later if we get to clarity on other requirements or we
> > decide
> > >>> >>> we
> > >>> >>>>>> need
> > >>> >>>>>>>> to
> > >>> >>>>>>>>>> approach it differently for whatever reason.  But in the
> > >>> >>>>>> meantime,
> > >>> >>>>>>>> before
> > >>> >>>>>>>>>> fully committing to this decision, I would appreciate
> > another
> > >>> >>>>>>>> perspective
> > >>> >>>>>>>>>> if someone has one.
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> Ron
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <
> > >>> >>>>>>>> rajinisivaram@gmail.com>
> > >>> >>>>>>>>>> wrote:
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>>> Hi Ron,
> > >>> >>>>>>>>>>>
> > >>> >>>>>>>>>>> Yes, I would leave out retries from this KIP for now. In
> > the
> > >>> >>>>>> future,
> > >>> >>>>>>>> if
> > >>> >>>>>>>>>>> there is a requirement for supporting retries, we can
> > >>> consider
> > >>> >>>>>> it. I
> > >>> >>>>>>>>> think
> > >>> >>>>>>>>>>> we can support retries with either approach if we needed
> > to,
> > >>> >>>>>> but it
> > >>> >>>>>>>>> would
> > >>> >>>>>>>>>>> be better to do it along with other changes required to
> > >>> >>> support
> > >>> >>>>>>>>>>> authentication servers that are not highly available.
> > >>> >>>>>>>>>>>
> > >>> >>>>>>>>>>> For maintainability, I am biased, so it will be good to
> get
> > >>> >>> the
> > >>> >>>>>>>>>>> perspective
> > >>> >>>>>>>>>>> of others in the community :-)
> > >>> >>>>>>>>>>>
> > >>> >>>>>>>>>>> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <
> > >>> >>>>>> rndgstn@gmail.com>
> > >>> >>>>>>>>> wrote:
> > >>> >>>>>>>>>>>
> > >>> >>>>>>>>>>>> Hi Rajini.  Here is some feedback/some things I thought
> > of.
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> First, I just realized that from a timing perspective
> > that I
> > >>> >>>>>> am
> > >>> >>>>>>> not
> > >>> >>>>>>>>> sure
> > >>> >>>>>>>>>>>> retry is going to be necessary.  The background login
> > >>> >>> refresh
> > >>> >>>>>>> thread
> > >>> >>>>>>>>>>>> triggers re-authentication when it refreshes the
> client's
> > >>> >>>>>>>> credential.
> > >>> >>>>>>>>>>> The
> > >>> >>>>>>>>>>>> OAuth infrastructure has to be available in order for
> this
> > >>> >>>>>> refresh
> > >>> >>>>>>>> to
> > >>> >>>>>>>>>>>> succeed (the background thread repeatedly retries if it
> > >>> >>> can't
> > >>> >>>>>>>> refresh
> > >>> >>>>>>>>>>> the
> > >>> >>>>>>>>>>>> credential, and that retry loop handles any temporary
> > >>> >>>>>> outage).  If
> > >>> >>>>>>>>>>> clients
> > >>> >>>>>>>>>>>> are told to re-authenticate when the credential is
> > refreshed
> > >>> >>>>>> **and
> > >>> >>>>>>>>> they
> > >>> >>>>>>>>>>>> actually re-authenticate at that point** then it is
> highly
> > >>> >>>>>>> unlikely
> > >>> >>>>>>>>> that
> > >>> >>>>>>>>>>>> the OAuth infrastructure would fail within those
> > intervening
> > >>> >>>>>>>>>>> milliseconds.
> > >>> >>>>>>>>>>>> So we don't need re-authentication retry in this KIP as
> > long
> > >>> >>>>>> as
> > >>> >>>>>>>> retry
> > >>> >>>>>>>>>>>> starts immediately.  The low-level prototype as
> currently
> > >>> >>>>>> coded
> > >>> >>>>>>>>> doesn't
> > >>> >>>>>>>>>>>> actually start re-authentication until the connection is
> > >>> >>>>>>>> subsequently
> > >>> >>>>>>>>>>> used,
> > >>> >>>>>>>>>>>> and that could take a while.  But then again, if the
> > >>> >>>>>> connection
> > >>> >>>>>>>> isn't
> > >>> >>>>>>>>>>> used
> > >>> >>>>>>>>>>>> for some period of time, then the lost value of having
> to
> > >>> >>>>>> abandon
> > >>> >>>>>>>> the
> > >>> >>>>>>>>>>>> connection is lessened anyway.  Plus, as you pointed
> out,
> > >>> >>>>>> OAuth
> > >>> >>>>>>>>>>>> infrastructure is assumed to be highly available.
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> Does this makes sense, and would you be willing to say
> > that
> > >>> >>>>>> retry
> > >>> >>>>>>>>> isn't
> > >>> >>>>>>>>>>> a
> > >>> >>>>>>>>>>>> necessary requirement?  I'm tempted but would like to
> hear
> > >>> >>>>>> your
> > >>> >>>>>>>>>>> perspective
> > >>> >>>>>>>>>>>> first.
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> Interleaving/latency and maintainability (more than
> lines
> > of
> > >>> >>>>>> code)
> > >>> >>>>>>>> are
> > >>> >>>>>>>>>>> the
> > >>> >>>>>>>>>>>> two remaining areas of comparison.  I did not add the
> > >>> >>>>>>>> maintainability
> > >>> >>>>>>>>>>> issue
> > >>> >>>>>>>>>>>> to the KIP yet, but before I do I thought I would
> address
> > it
> > >>> >>>>>> here
> > >>> >>>>>>>>> first
> > >>> >>>>>>>>>>> to
> > >>> >>>>>>>>>>>> see if we can come to consensus on it because I'm not
> > sure I
> > >>> >>>>>> see
> > >>> >>>>>>> the
> > >>> >>>>>>>>>>>> high-level approach as being hard to maintain (yet -- I
> > >>> >>> could
> > >>> >>>>>> be
> > >>> >>>>>>>>>>>> convinced/convince myself; we'll see).  I do want to
> make
> > >>> >>>>>> sure we
> > >>> >>>>>>>> are
> > >>> >>>>>>>>> on
> > >>> >>>>>>>>>>>> the same page about what is required to add
> > >>> >>> re-authentication
> > >>> >>>>>>>> support
> > >>> >>>>>>>>> in
> > >>> >>>>>>>>>>>> the high-level case.  Granted, the amount is zero in the
> > >>> >>>>>> low-level
> > >>> >>>>>>>>> case,
> > >>> >>>>>>>>>>>> and it doesn't get any better that that, but the amount
> in
> > >>> >>> the
> > >>> >>>>>>>>>>> high-level
> > >>> >>>>>>>>>>>> case is very low -- just a few lines of code.  For
> > example:
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> KafkaAdminClient:
> > >>> >>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/commits/
> > >>> >>> 4fa70f38b9
> > >>> >>>>>>>>>>>> d33428ff98b64a3a2bd668f5f28c38#diff-
> > >>> >>>>>>> 6869b8fccf6b098cbcb0676e8ceb26a7
> > >>> >>>>>>>>>>>> It is the same few lines of code for KafkaConsumer,
> > >>> >>>>>> KafkaProducer,
> > >>> >>>>>>>>>>>> WorkerGroupMember, and TransactionMarkerChannelManager
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> The two synchronous I/O use cases are
> > >>> >>>>>> ControllerChannelManager and
> > >>> >>>>>>>>>>>> ReplicaFetcherBlockingSend (via ReplicaFetcherThread),
> and
> > >>> >>>>>> they
> > >>> >>>>>>>>> require
> > >>> >>>>>>>>>>> a
> > >>> >>>>>>>>>>>> little bit more -- but not much.
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> Thoughts?
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> Ron
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <
> > >>> >>>>>> rndgstn@gmail.com>
> > >>> >>>>>>>>>>> wrote:
> > >>> >>>>>>>>>>>>
> > >>> >>>>>>>>>>>>> Thanks, Rajini.  Before I digest/respond to that,
> here's
> > >>> >>> an
> > >>> >>>>>>> update
> > >>> >>>>>>>>>>> that I
> > >>> >>>>>>>>>>>>> just completed.
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>> I added a commit to the PR (
> > >>> >>>>>>>>>>> https://github.com/apache/kafka/pull/5582/)
> > >>> >>>>>>>>>>>>> that implements server-side kill of expired OAUTHBEARER
> > >>> >>>>>>>> connections.
> > >>> >>>>>>>>>>> No
> > >>> >>>>>>>>>>>>> tests yet since we still haven't settled on a final
> > >>> >>> approach
> > >>> >>>>>>>>>>> (low-level
> > >>> >>>>>>>>>>>> vs.
> > >>> >>>>>>>>>>>>> high-level).
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>> I also updated the KIP to reflect the latest discussion
> > >>> >>> and
> > >>> >>>>>> PR
> > >>> >>>>>>> as
> > >>> >>>>>>>>>>>> follows:
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>  - Include support for brokers killing connections as
> > >>> >>>>>> part of
> > >>> >>>>>>>> this
> > >>> >>>>>>>>>>> KIP
> > >>> >>>>>>>>>>>>>  (rather than deferring it to a future KIP as was
> > >>> >>>>>> originally
> > >>> >>>>>>>>>>>> mentioned; the
> > >>> >>>>>>>>>>>>>  PR now includes it as mentioned -- it was very easy to
> > >>> >>>>>> add)
> > >>> >>>>>>>>>>>>>  - Added metrics (they will mirror existing ones
> related
> > >>> >>>>>> to
> > >>> >>>>>>>>>>>>>  authentications; I have not added those to the PR)
> > >>> >>>>>>>>>>>>>  - Updated the implementation description to reflect
> the
> > >>> >>>>>>> current
> > >>> >>>>>>>>>>> state
> > >>> >>>>>>>>>>>>>  of the PR, which is a high-level, one-size-fits-all
> > >>> >>>>>> approach
> > >>> >>>>>>>> (as
> > >>> >>>>>>>>>>>> opposed to
> > >>> >>>>>>>>>>>>>  my initial, even-higher-level approach)
> > >>> >>>>>>>>>>>>>  - Added a "Rejected Alternative" for the first version
> > >>> >>>>>> of the
> > >>> >>>>>>>> PR,
> > >>> >>>>>>>>>>>>>  which injected requests directly into synchronous I/O
> > >>> >>>>>>> clients'
> > >>> >>>>>>>>>>> queues
> > >>> >>>>>>>>>>>>>  - Added a "Rejected Alternative" for the low-level
> > >>> >>>>>> approach
> > >>> >>>>>>> as
> > >>> >>>>>>>>>>>>>  suggested, but of course we have not formally decided
> > >>> >>> to
> > >>> >>>>>>> reject
> > >>> >>>>>>>>>>> this
> > >>> >>>>>>>>>>>>>  approach or adopt the current PR implementation.
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>> I'll think about where we stand some more before
> > >>> >>> responding
> > >>> >>>>>>> again.
> > >>> >>>>>>>>>>>> Thanks
> > >>> >>>>>>>>>>>>> for the above reply.
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>> Ron
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
> > >>> >>>>>>>>>>> rajinisivaram@gmail.com>
> > >>> >>>>>>>>>>>>> wrote:
> > >>> >>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> Hi Ron,
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> Thank you for summarising, I think it covers the
> > >>> >>>>>> differences
> > >>> >>>>>>>>> between
> > >>> >>>>>>>>>>> the
> > >>> >>>>>>>>>>>>>> two approaches well.
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> A few minor points to answer the questions in there:
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> 1) When re-authetication is initiated in the Selector
> > >>> >>>>>> during
> > >>> >>>>>>>>> poll(),
> > >>> >>>>>>>>>>> we
> > >>> >>>>>>>>>>>>>> can
> > >>> >>>>>>>>>>>>>> move an idle channel to re-authentication state. It is
> > >>> >>>>>> similar
> > >>> >>>>>>> to
> > >>> >>>>>>>>>>>>>> injecting
> > >>> >>>>>>>>>>>>>> requests, but achieved by changing channel back to
> > >>> >>>>>>> authenticating
> > >>> >>>>>>>>>>> state.
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> 3) To clarify why I think re-authentication should fit
> > in
> > >>> >>>>>> with
> > >>> >>>>>>>> our
> > >>> >>>>>>>>>>>>>> authentication design: My point was not about a
> specific
> > >>> >>>>>>>> connection
> > >>> >>>>>>>>>>>> being
> > >>> >>>>>>>>>>>>>> usable or not usable. It was about what happens at the
> > >>> >>>>>> client
> > >>> >>>>>>> API
> > >>> >>>>>>>>>>> level.
> > >>> >>>>>>>>>>>>>> Our client API (producer/consumer/admin client etc.)
> > >>> >>>>>> currently
> > >>> >>>>>>>>> assume
> > >>> >>>>>>>>>>>> that
> > >>> >>>>>>>>>>>>>> a single broker authentication failure is a fatal
> error
> > >>> >>>>>> that is
> > >>> >>>>>>>>> never
> > >>> >>>>>>>>>>>>>> retried because we assume that broker only ever fails
> an
> > >>> >>>>>>>>>>> authentication
> > >>> >>>>>>>>>>>>>> request if credentials are invalid. If we ever decide
> to
> > >>> >>>>>>> support
> > >>> >>>>>>>>>>> cases
> > >>> >>>>>>>>>>>>>> where broker occasionally fails an authentication
> > request
> > >>> >>>>>> due
> > >>> >>>>>>> to
> > >>> >>>>>>>> a
> > >>> >>>>>>>>>>>>>> transient failure, we need to do more around how we
> > >>> >>> handle
> > >>> >>>>>>>>>>>> authentication
> > >>> >>>>>>>>>>>>>> failures in clients. We may decide that it is ok to
> > close
> > >>> >>>>>> the
> > >>> >>>>>>>>>>> connection
> > >>> >>>>>>>>>>>>>> for authentication and not for re-authentication as
> you
> > >>> >>>>>>>> mentioned,
> > >>> >>>>>>>>>>> but
> > >>> >>>>>>>>>>>> we
> > >>> >>>>>>>>>>>>>> need to change the way this disconnection is handled
> by
> > >>> >>>>>>> clients.
> > >>> >>>>>>>> So
> > >>> >>>>>>>>>>> IMO,
> > >>> >>>>>>>>>>>>>> we
> > >>> >>>>>>>>>>>>>> should either add support for transient retriable
> > >>> >>>>>>> authentication
> > >>> >>>>>>>>>>>> failures
> > >>> >>>>>>>>>>>>>> properly or not retry for any scenario. Personally, I
> > >>> >>> don't
> > >>> >>>>>>> think
> > >>> >>>>>>>>> we
> > >>> >>>>>>>>>>>> would
> > >>> >>>>>>>>>>>>>> want to retry all authentication failures even if it
> is
> > a
> > >>> >>>>>>>>>>>>>> re-authentication, I think we could (at some point in
> > >>> >>>>>> future),
> > >>> >>>>>>>>> allow
> > >>> >>>>>>>>>>>>>> brokers to return an error code that indicates that it
> > >>> >>> is a
> > >>> >>>>>>>>> transient
> > >>> >>>>>>>>>>>>>> broker-side failure rather than invalid credentials
> and
> > >>> >>>>>> handle
> > >>> >>>>>>>> the
> > >>> >>>>>>>>>>> error
> > >>> >>>>>>>>>>>>>> differently. I see no reason at that point why we
> > >>> >>> wouldn't
> > >>> >>>>>>> handle
> > >>> >>>>>>>>>>>>>> authentication and re-authentication in the same way.
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> 4) As you said, the high-level approach would be
> bigger
> > >>> >>>>>> than
> > >>> >>>>>>> the
> > >>> >>>>>>>>>>>> low-level
> > >>> >>>>>>>>>>>>>> approach in terms of LOC. But I wouldn't be too
> > concerned
> > >>> >>>>>> about
> > >>> >>>>>>>>>>> lines of
> > >>> >>>>>>>>>>>>>> code. My bigger concern was about modularity. Our
> > >>> >>> security
> > >>> >>>>>> code
> > >>> >>>>>>>> is
> > >>> >>>>>>>>>>>> already
> > >>> >>>>>>>>>>>>>> complex, protocols like Kerberos and SSL that we use
> > from
> > >>> >>>>>> the
> > >>> >>>>>>> JRE
> > >>> >>>>>>>>>>> make
> > >>> >>>>>>>>>>>>>> problem diagnosis hard. Async I/O makes the networking
> > >>> >>> code
> > >>> >>>>>>>>> complex.
> > >>> >>>>>>>>>>> You
> > >>> >>>>>>>>>>>>>> need to understand networking layer to work with the
> > >>> >>>>>> security
> > >>> >>>>>>>>> layer,
> > >>> >>>>>>>>>>> but
> > >>> >>>>>>>>>>>>>> the rest of the code base doesn't rely on knowledge of
> > >>> >>>>>>>>>>> network/security
> > >>> >>>>>>>>>>>>>> layers. My main concern about the high-level approach
> is
> > >>> >>>>>> that
> > >>> >>>>>>> it
> > >>> >>>>>>>>>>> spans
> > >>> >>>>>>>>>>>>>> these boundaries, making it harder to maintain in the
> > >>> >>> long
> > >>> >>>>>> run.
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski
> <
> > >>> >>>>>>>>>>>>>> stanislav@confluent.io> wrote:
> > >>> >>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> Hi Ron, Rajini
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> Thanks for summarizing the discussion so far, Ron!
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> 1) How often do we have such long-lived connection
> > >>> >>>>>> idleness
> > >>> >>>>>>>> (e.g
> > >>> >>>>>>>>>>> 5-10
> > >>> >>>>>>>>>>>>>>> minutes) in practice?
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> 3) I agree that retries for re-authentication are
> > >>> >>> useful.
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> 4) The interleaving of requests sounds like a great
> > >>> >>>>>> feature
> > >>> >>>>>>> to
> > >>> >>>>>>>>>>> have,
> > >>> >>>>>>>>>>>> but
> > >>> >>>>>>>>>>>>>>> the tradeoff against code complexity is a tough one.
> I
> > >>> >>>>>> would
> > >>> >>>>>>>>>>>> personally
> > >>> >>>>>>>>>>>>>> go
> > >>> >>>>>>>>>>>>>>> with the simpler approach since you could always add
> > >>> >>>>>>>> interleaving
> > >>> >>>>>>>>>>> on
> > >>> >>>>>>>>>>>>>> top if
> > >>> >>>>>>>>>>>>>>> the community decides the latency should be better.
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> Best,
> > >>> >>>>>>>>>>>>>>> Stanislav
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <
> > >>> >>>>>>>> rndgstn@gmail.com
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>>>>>> wrote:
> > >>> >>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> Hi everyone.  I've updated the PR to reflect the
> > >>> >>> latest
> > >>> >>>>>>>>>>> conclusions
> > >>> >>>>>>>>>>>>>> from
> > >>> >>>>>>>>>>>>>>>> this ongoing discussion.  The KIP still needs the
> > >>> >>>>>> suggested
> > >>> >>>>>>>>>>>> updates; I
> > >>> >>>>>>>>>>>>>>> will
> > >>> >>>>>>>>>>>>>>>> do that later this week.  II agree with Rajini that
> > >>> >>>>>> some
> > >>> >>>>>>>>>>> additional
> > >>> >>>>>>>>>>>>>>>> feedback from the community at large would be very
> > >>> >>>>>> helpful
> > >>> >>>>>>> at
> > >>> >>>>>>>>>>> this
> > >>> >>>>>>>>>>>>>> point
> > >>> >>>>>>>>>>>>>>> in
> > >>> >>>>>>>>>>>>>>>> time.
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> Here's where we stand.
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> We have 2 separate implementations for
> > >>> >>>>>> re-authenticating
> > >>> >>>>>>>>> existing
> > >>> >>>>>>>>>>>>>>>> connections -- a so-called "low-level" approach and
> a
> > >>> >>>>>>>>> (relatively
> > >>> >>>>>>>>>>>>>>> speaking)
> > >>> >>>>>>>>>>>>>>>> "high-level" approach -- and we agree on how they
> > >>> >>>>>> should be
> > >>> >>>>>>>>>>>> compared.
> > >>> >>>>>>>>>>>>>>>> Specifically, Rajini has provided a mechanism that
> > >>> >>>>>> works
> > >>> >>>>>>> at a
> > >>> >>>>>>>>>>>>>> relatively
> > >>> >>>>>>>>>>>>>>>> low level in the stack by intercepting network sends
> > >>> >>>>>> and
> > >>> >>>>>>>>> queueing
> > >>> >>>>>>>>>>>>>> them up
> > >>> >>>>>>>>>>>>>>>> while re-authentication happens; then the queued
> > >>> >>> sends
> > >>> >>>>>> are
> > >>> >>>>>>>> sent
> > >>> >>>>>>>>>>>> after
> > >>> >>>>>>>>>>>>>>>> re-authentication succeeds, or the connection is
> > >>> >>>>>> closed if
> > >>> >>>>>>>>>>>>>>>> re-authentication fails.  See the prototype commit
> at
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> https://github.com/rajinisivaram/kafka/commit/
> > >>> >>>>>>> b9d711907ad843
> > >>> >>>>>>>>>>>>>>> c11d17e80d6743bfb1d4e3f3fd
> > >>> >>>>>>>>>>>>>>>> .
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> I have an implementation that works higher up in the
> > >>> >>>>>> stack;
> > >>> >>>>>>>> it
> > >>> >>>>>>>>>>>> injects
> > >>> >>>>>>>>>>>>>>>> authentication requests into the KafkaClient via a
> > >>> >>> new
> > >>> >>>>>>> method
> > >>> >>>>>>>>>>> added
> > >>> >>>>>>>>>>>> to
> > >>> >>>>>>>>>>>>>>> that
> > >>> >>>>>>>>>>>>>>>> interface, and the implementation (i.e.
> > >>> >>> NetworkClient)
> > >>> >>>>>>> sends
> > >>> >>>>>>>>> them
> > >>> >>>>>>>>>>>> when
> > >>> >>>>>>>>>>>>>>>> poll() is called.  The updated PR is available at
> > >>> >>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/.
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> Here is how we think these two approaches should be
> > >>> >>>>>>> compared.
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> 1) *When re-authentication starts*.  The low-level
> > >>> >>>>>> approach
> > >>> >>>>>>>>>>>> initiates
> > >>> >>>>>>>>>>>>>>>> re-authentication only if/when the connection is
> > >>> >>>>>> actually
> > >>> >>>>>>>> used,
> > >>> >>>>>>>>>>> so
> > >>> >>>>>>>>>>>> it
> > >>> >>>>>>>>>>>>>> may
> > >>> >>>>>>>>>>>>>>>> start after the existing credential expires; the
> > >>> >>>>>> current PR
> > >>> >>>>>>>>>>>>>>> implementation
> > >>> >>>>>>>>>>>>>>>> injects re-authentication requests into the existing
> > >>> >>>>>> flow,
> > >>> >>>>>>>> and
> > >>> >>>>>>>>>>>>>>>> re-authentication starts immediately regardless of
> > >>> >>>>>> whether
> > >>> >>>>>>> or
> > >>> >>>>>>>>> not
> > >>> >>>>>>>>>>>> the
> > >>> >>>>>>>>>>>>>>>> connection is being used for something else.  Rajini
> > >>> >>>>>>> believes
> > >>> >>>>>>>>> the
> > >>> >>>>>>>>>>>>>>> low-level
> > >>> >>>>>>>>>>>>>>>> approach can be adjusted to re-authenticate idle
> > >>> >>>>>>> connections
> > >>> >>>>>>>>>>>> (Rajini,
> > >>> >>>>>>>>>>>>>> I
> > >>> >>>>>>>>>>>>>>>> don't see how this can be done without injecting
> > >>> >>>>>> requests,
> > >>> >>>>>>>>> which
> > >>> >>>>>>>>>>> is
> > >>> >>>>>>>>>>>>>> what
> > >>> >>>>>>>>>>>>>>>> the high-level connection is doing, but I am
> probably
> > >>> >>>>>>> missing
> > >>> >>>>>>>>>>>>>> something
> > >>> >>>>>>>>>>>>>>> --
> > >>> >>>>>>>>>>>>>>>> no need to go into it unless it is easy to explain.)
> > >>> >>>>>>>>>>>>>>>>
> > >>> >>>>>>>>>>>>>>>> 2) *When requests not related to re-authentication
> > >>> >>> are
> > >>> >>>>>> able
> > >>> >>>>>>>> to
> > >>> >>>>>>>>>>> use
> > >>> >>>>>>>>>>>> the
> > >>> >>>>>>>>>>>>>>>> re-authenticating connection*.  The low-level
> > >>> >>> approach
> > >>> >>>>>>>> finishes
> > >>> >>>>>>>>>>>>>>>> re-authentication completely before allowing
> anything
> > >>> >>>>>> else
> > >>> >>>>>>> to
> > >>> >>>>>>>>>>>> traverse
> > >>> >>>>>>>>>>>>>>> the
> > >>> >>>>>>>>>>>>>>>> connection, and we agree that this is how the
> > >>> >>> low-level
> > >>> >>>>>>>>>>>> implementation
> > >>> >>>>>>>>>>>>>>> must
> > >>> >>>>>>>>>>>>>>>> work without significant work/changes.  The current
> > >>> >>> PR
> > >>> >>>>>>>>>>>> implementation
> > >>> >>>>>>>>>>>>>>>> interleaves re-authentication requests with the
> > >>> >>>>>> existing
> > >>> >>>>>>> flow
> > >>> >>>>>>>>> in
> > >>> >>>>>>>>>>> the
> > >>> >>>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message