kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Egerton <chr...@confluent.io>
Subject Re: [DISCUSS] KIP-507: Securing Internal Connect REST Endpoints
Date Tue, 17 Sep 2019 17:36:56 GMT
Hi Randall,

I've added your suggested paragraph to the KIP; it definitely clarifies the
intended behavior more than the content that it replaced. Thanks!

As far as I can tell, the two items remaining open for discussion are the
names for the new configs (which should be made less REST
request-specific), and whether separate configs should be enabled for the
key generation and MAC signing algorithms. As soon as I've gotten some
research done on the latter, I'll report back and then we can hopefully
also begin discussing the former.

Cheers,

Chris

On Mon, Sep 16, 2019 at 4:28 PM Randall Hauch <rhauch@gmail.com> wrote:

> On Mon, Sep 16, 2019 at 3:06 PM Chris Egerton <chrise@confluent.io> wrote:
>
> > Hi Randall,
> >
> > The new default value for the key size configuration will be "null". I've
> > clarified this in the KIP. This will still preserve backwards
> compatibility
> > and should not be an issue.
> >
>
> Thanks for clarifying this in the KIP.
>
>
> >
> > I understand your point about key generation vs MAC signing algorithms;
> > like I said, I'll need to do more research.
> >
> > I respectfully disagree that a single algorithm list would be easier to
> > understand as opposed to a list of accepted algorithms and a signature
> > algorithm. Placing special priority on the first element in that list is
> > fairly implicit and leaves room for confusion where users might have the
> > same algorithms in their lists for that config but in different orders
> for
> > different workers. The group coordination protocol selection behavior
> isn't
> > really a great example since users don't configure that directly
> > themselves.
> >
> > RE the proposed set of new configs: like I stated in my previous
> response,
> > I object to the use of "cluster." as a configuration prefix for any
> worker
> > configs: "most configurations deal with the worker and, transitively, the
> > cluster; there doesn't seem to be good enough cause for including that
> > prefix for these configurations." I also think this discussion should
> > probably continue a little more before we start proposing new
> configuration
> > names, given that it's still undecided which configurations we're going
> to
> > expose.
> >
> > > I don't think we have any data about how how often a follower will be
> > fully caught up, and it's possible that a worker's consumer fails to keep
> > the worker caught up quickly enough with the configs topic and the new
> > session key. So can we really say that a follower making a request with
> an
> > expired key will be rare?
> > It would depend on the key rotation interval, but I can't imagine the
> > likelihood of this occurring with an interval as low as even 10 minutes
> > would be that high. The config topic is low-volume; the consumer for that
> > topic isn't going to be flooded with writes and it seems fine to expect
> > fairly low latency for the consumer of this topic.
> >
>
> My concern is that we're really *assuming* it's not a problem. All I'm
> asking for is more clarity on what happens when a worker doesn't know about
> the new session key when it makes a request to this REST resource? The KIP
> makes it clear that It will be retried and that the existing error message
> will be replaced with a debug message, at least for a time being. Perhaps
> the KIP paragraph that talks about this can be reworded to make this more
> clear, something akin to:
>
> "The leader will only accept requests signed with the most current key.
> However, Connect follower workers may routinely experience small delays
> when reading the new key. Rather than always logging such task
> configuration failure and retry attempts as errors (the current behavior),
> Connect's distributed herder will be modified slightly to handle such HTTP
> 403 responses for this task configuration request by quietly retrying them
> with the latest key for up to 1 minute. If failures persist for more than 1
> minute, they will be logged as errors."
>
>
> > > Can we not retry for longer than 1 second if the request fails with
> HTTP
> > 403? I'm concerned what the UX will be if/when this happens, and that the
> > user sees a very obtuse and seemingly unrelated error message they won't
> > know how to fix.
> > To be clear, the KIP doesn't propose any changes to the infinite retry
> > logic that's present in the worker today. All that the KIP proposes is
> that
> > an existing error-level message be replaced with a debug-level message if
> > it's suspected that a request has failed due to an out-of-date key. With
> > that out of the way, sure, we can bump the grace period before beginning
> to
> > emit error-level log messages. I think going as high as minute might be
> > acceptable; we should try to stay fairly low, however, in case the
> request
> > failures are due to some other reason that should be surfaced as soon as
> > possible and with some urgency.
> >
>
> Ack. See above .
>
> >
> > > The text in the "Failure to relay task configurations to leader due to
> > incorrect configuration" section is similarly ambiguous. How will a user
> > know that this is occurring, and is this really an unlikely situation
> > (e.g., "This scenario is unlikely to occur with any normal usage of
> > Connect.")? Seems like users unintentionally misconfiguring some of their
> > Connect workers is quite common.
> > A user will know that this is occurring based on error-level log messages
> > emitted by the worker about a failure to relay task ("Failed to
> reconfigure
> > connector's tasks, retrying after backoff:"), plus a stack trace. Yes,
> this
> > situation is unlikely; most worker files will never contain the
> > newly-proposed configurations for this KIP since the default values
> should
> > suffice for most cases. If anyone tries to adjust these values, we will
> > have documentation available on what their behavior is and what to do to
> > ensure your cluster doesn't break while changing them (including the
> > rolling upgrade procedure outlined in the KIP). These configurations make
> > it no likelier that a user will accidentally break their Connect cluster
> > than the group ID, key/value converter, REST extension, and broker
> > authentication configs (to name just a few), and since they will be left
> > out of any sample worker configurations included in Kafka, the odds of
> > someone accidentally messing with them are low enough that it doesn't
> seem
> > worth investing a lot of effort into making it harder for someone to
> shoot
> > themselves in the foot.
> > I'll update the KIP to include possible indicators that the cluster has
> > been misconfigured, but I don't think this scenario deserves a ton of
> > priority.
> >
> > > Maybe provide a bit more description of what these error messages will
> > be.
> > I believe this is more of an implementation detail and should be left to
> PR
> > review. KIPs should only be expected to be so fine-grained; proposing
> > actual log messages doesn't seem necessary for the overall
> > adoption/rejection of the mechanisms proposed here and the iteration
> cycle
> > on GitHub is significantly faster and more efficient than on a mailing
> > list.
> >
> > > Do we need a JMX metric that shows the protocol that each worker is
> > configured to use, and whether the workers are using session keys? This
> > would be a great way to monitor the cluster's use of this feature.
> > I think exposing the connect protocol in use would be useful, sure. You
> > could deduce whether workers are using session keys based on this so
> > there'd only be a need to expose one additional metric.
> >
>
> Thanks!
>
> Randall
>
>
> >
> > Thanks again for the feedback!
> >
> > Cheers,
> >
> > Chris
> >
> > On Mon, Sep 16, 2019 at 11:58 AM Randall Hauch <rhauch@gmail.com> wrote:
> >
> > > Thanks, Chris. I still have a number of questions and requests for
> > > clarification.
> > >
> > > If we don't provide a default value for the key size, then the
> following
> > > statement in the "Backwards compatibility" subsection is no longer
> true:
> > > "All of the proposed configurations here have default values, making
> them
> > > backwards compatible." IIUC, a user will not be able to upgrade an
> > existing
> > > Connect cluster without editing their configurations, and this pretty
> > much
> > > means it is not backwards compatible, which is a non-starter.
> > >
> > > Regarding the new configurations and our earlier discussion about
> whether
> > > all are required, I'm not convinced that we need separate configs for
> > > signing and key generation algorithms. If this is a common need, then
> the
> > > KIP should justify that with an explanation. But as it stands now,
> > exposing
> > > multiple algorithm properties now means the UX is more complicated and
> we
> > > can't make things simpler in the future. I also think that a single
> > > algorithm list where the first is used for signing would be easier to
> > > understand (fewer is better) and would be more in line with the way
> > > rebalance protocols are chosen the broker (see
> > > https://www.youtube.com/watch?v=MmLezWRI3Ys for a decent explanation).
> > If
> > > at some point in the future we do want that extra flexibility, then we
> > can
> > > expose additional properties.
> > >
> > > I also asked earlier about renaming the config properties so they can
> be
> > > used in other places within the cluster other than the task configs
> > > request, which is something I think we'll want to do. If we minimize
> the
> > > number of configs, then how about `cluster.session.algorithms`,
> > > `cluster.session.key.size` and `cluster.session.key.ttl.ms`?
> > >
> > > The "Proposed Changes" section now mentions:
> > >
> > > The leader will only accept requests signed with the most current key.
> > This
> > > > should not cause any major problems; if a follower attempts to make a
> > > > request with an expired key (which should be quite rare and only
> occur
> > if
> > > > the request is made by a follower that is not fully caught up to the
> > end
> > > of
> > > > the config topic), the initial request will fail, but will be
> > > subsequently
> > > > retried after a backoff period. This backoff period should leave
> > > sufficient
> > > > room for the rebalance to complete.
> > > >
> > >
> > > I don't think we have any data about how how often a follower will be
> > fully
> > > caught up, and it's possible that a worker's consumer fails to keep the
> > > worker caught up quickly enough with the configs topic and the new
> > session
> > > key. So can we really say that a follower making a request with an
> > expired
> > > key will be rare?
> > >
> > > If the first four requests fail with HTTP 403, it will be assumed that
> > this
> > > > is due to an out-of-date session key; a debug-level message about the
> > > > subsequent retry will be logged in place of the current error-level
> log
> > > > message of "Failed to reconfigure connector's tasks, retrying after
> > > > backoff: " followed by a stack trace. Since the backoff period is 250
> > > > milliseconds, this should give at least one second of leeway for an
> > > > outdated key to be updated. If longer than that is required, the
> usual
> > > > error-level log messages will begin to be generated by the worker.
> > > >
> > >
> > > Can we not retry for longer than 1 second if the request fails with
> HTTP
> > > 403? I'm concerned what the UX will be if/when this happens, and that
> the
> > > user sees a very obtuse and seemingly unrelated error message they
> won't
> > > know how to fix.
> > >
> > > The text in the "Failure to relay task configurations to leader due to
> > > incorrect configuration" section is similarly ambiguous. How will a
> user
> > > know that this is occurring, and is this really an unlikely situation
> > > (e.g., "This scenario is unlikely to occur with any normal usage of
> > > Connect.")? Seems like users unintentionally misconfiguring some of
> their
> > > Connect workers is quite common.
> > >
> > > Additionally, it will be vital to design appropriate error messages for
> > > > this scenario so that users can (hopefully) dig themselves out of
> that
> > > hole
> > > > on their own.
> > >
> > >
> > > Maybe provide a bit more description of what these error messages will
> > be.
> > >
> > > Do we need a JMX metric that shows the protocol that each worker is
> > > configured to use, and whether the workers are using session keys? This
> > > would be a great way to monitor the cluster's use of this feature.
> > >
> > > Best regards,
> > >
> > > Randall
> > >
> > >
> > > On Wed, Sep 11, 2019 at 7:00 PM Chris Egerton <chrise@confluent.io>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've updated KIP-507 to reflect the changes inspired by Randall's
> > recent
> > > > feedback. In addition, after some further research, I've decided to
> > > remove
> > > > the proposed default value for the internal.request.key.size and
> > instead,
> > > > should no value be provided, rely on the default key size for the
> given
> > > key
> > > > algorithm. More detail can be found in the KIP if anyone's
> interested.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Sep 10, 2019 at 3:18 PM Chris Egerton <chrise@confluent.io>
> > > wrote:
> > > >
> > > > > Hi Randall,
> > > > >
> > > > > Thanks so much for your comprehensive feedback! To avoid creating
> an
> > > > > extremely long response I'll address your comments/questions by
> > > > referencing
> > > > > the identifiers you've provided for them as opposed to copying them
> > and
> > > > > responding inline; hopefully things are still readable :)
> > > > >
> > > > >
> > > > > RE new configs:
> > > > >
> > > > > a) I believe exposing these configurations right now is the correct
> > > > > approach. There are two scenarios that we should aim to support:
> > > running
> > > > > Connect on a bare-bones JVM that doesn't implement anything beyond
> > the
> > > > > requirements of the JVM specs in terms of key generation and key
> > > signing
> > > > > algorithms, and running Connect in an environment with hard
> security
> > > > > requirements for, e.g., FIPS compliance. The default values for
> these
> > > > > configurations are intended to satisfy the first use case; however,
> > in
> > > > > order to enable users to conform with higher security standards
> (the
> > > > second
> > > > > use case), some key generation and key signing algorithms that are
> > not
> > > > > guaranteed to be present on every implementation of the JVM may be
> > > > > required. I don't see a way to satisfy both use cases without
> adding
> > > > > configurability.
> > > > > With regards to key size: yes, this needs to be configurable as
> there
> > > are
> > > > > different ranges of acceptable key size for different key
> generation
> > > > > algorithms; for example, the "AES" algorithm for key generation
> > > requires
> > > > a
> > > > > key size of either 128, 192 or 256 (at least on my local JVM)
> whereas
> > > the
> > > > > "DES" algorithm requires a key size of exactly 56.
> > > > > As far as enabling different algorithms for key generation vs.
> > request
> > > > > signing goes... I'm not sure. Local tests involving keys generated
> > with
> > > > > different algorithms from the mac algorithms used to sign inputs
> > (e.g.,
> > > > > combining an HmacSHA256 key with an HmacMD5 mac or using a DES key
> > with
> > > > an
> > > > > HmacSHA256 mac) haven't thrown any exceptions yet but this is a
> > little
> > > > > beyond the boundaries of my current knowledge, so I'll have to get
> > back
> > > > to
> > > > > you on that point. At the very least, your question (and the
> research
> > > > I've
> > > > > done so far to try to address it) has highlighted a bug in my
> current
> > > PR
> > > > > that'll definitely need to be fixed :)
> > > > >
> > > > > b) The riskiest scenario is if a follower worker is configured to
> > use a
> > > > > request signing algorithm that isn't allowed by the leader. In this
> > > case,
> > > > > any failure will only occur if/when that follower starts up a
> > connector
> > > > and
> > > > > then has to forward tasks for that connector to the leader, which
> may
> > > not
> > > > > happen immediately. Once that failure occurs, an endless failure
> loop
> > > > will
> > > > > occur wherein the follower endlessly retries to send those task
> > > > > configurations to the leader and pauses by the backoff interval in
> > > > between
> > > > > each failed request.
> > > > > There are two ways to rectify this situation; either shut down the
> > > > > follower and restart it after editing its configuration to use a
> > > request
> > > > > signing algorithm permitted by the leader, or shut down all other
> > > workers
> > > > > in the cluster that do not permit the request signing algorithm
> used
> > by
> > > > the
> > > > > follower, reconfigure them to permit it, and then restart them.
> > > > > This scenario is unlikely to occur with any normal usage of
> Connect,
> > > but
> > > > > the circumstances leading to it will need to be called out very
> > > carefully
> > > > > in order to avoid putting the cluster into a bad state and flooding
> > > logs
> > > > > with scary-looking error messages. Speaking of, it'll be vital to
> > > design
> > > > > appropriate error messages for this scenario so that users can
> > > > (hopefully)
> > > > > dig themselves out of that hole on their own.
> > > > > Differing values for any of the other configs shouldn't actually be
> > too
> > > > > problematic, given that the three remaining configs all deal with
> key
> > > > > generation/rotation, which is only handled by one worker at a time
> > (the
> > > > > leader).
> > > > >
> > > > > c) Good call. Yes, these configs are all "low" importance and I'll
> > > update
> > > > > the KIP accordingly to reflect that.
> > > > >
> > > > > d) No, there is no window when multiple keys are valid. This is
> > > > explicitly
> > > > > called out in the KIP at the end of the "Proposed Changes" section:
> > > > > > The leader will only accept requests signed with the most current
> > > key.
> > > > > This should not cause any major problems; if a follower attempts to
> > > make
> > > > a
> > > > > request with an expired key (which should be quite rare and only
> > occur
> > > if
> > > > > the request is made by a follower that is not fully caught up to
> the
> > > end
> > > > of
> > > > > the config topic), the initial request will fail, but will be
> > > > subsequently
> > > > > retried after a backoff period.
> > > > >
> > > > > e) I'll update the KIP to reflect the fact that, barring any need
> for
> > > > > heightened levels of security compliance, none of these
> > configurations
> > > > > should need to be altered and the defaults should suffice.
> > > > >
> > > > > f) As mentioned above, a larger key size isn't necessarily possible
> > for
> > > > > all key generation algorithms. I believe these defaults are the
> best
> > we
> > > > can
> > > > > work with until/unless other algorithms are added to the JVM spec.
> > > > >
> > > > > g) I don't believe there's a lot of precedent for configuration
> > > > properties
> > > > > like this in AK. The closest I could find was the
> > "password.encoder.*"
> > > > > broker properties, but those deal with storing and encrypting
> > > passwords,
> > > > as
> > > > > opposed to generating keys and signing requests with them. I'm
> > leaning
> > > > > towards the stance that it's not worth calling out this lack of
> > > precedent
> > > > > in the KIP itself but if you or others feel differently I'll go
> ahead
> > > and
> > > > > add a small section on it.
> > > > >
> > > > > h) These configs were (are) definitely targeted toward validating
> > > > internal
> > > > > REST traffic, although in theory I suppose they could be used to
> > > validate
> > > > > any inter-worker communication that can be reduced to a byte array
> > for
> > > > > signing. I think that the mechanism proposed here should be limited
> > to
> > > > use
> > > > > for internal communications only, so the "internal." prefix still
> > seems
> > > > > appropriate. Using "cluster." in conjunction with that
> > > > > ("internal.cluster.key.generation.algorithm", for example), doesn't
> > > feel
> > > > > quite right as it might raise the question of what an "internal
> > > cluster"
> > > > > is. Reversing the order
> ("cluster.internal.key.generation.algorithm")
> > > > seems
> > > > > redundant; most configurations deal with the worker and,
> > transitively,
> > > > the
> > > > > cluster; there doesn't seem to be good enough cause for including
> > that
> > > > > prefix for these configurations.
> > > > > Two alternatives I can think of are "internal.communication." and
> > > > > "inter.worker.", but I'm open to suggestions. I'll leave the
> configs
> > > > as-are
> > > > > pending further discussion; these things tend to change rapidly as
> > new
> > > > > minds join the conversation.
> > > > >
> > > > > i) Yeah, some startup analysis to make sure that the list of
> > permitted
> > > > > verification algorithms includes the signature algorithm is
> probably
> > > > > warranted. However, an approach of taking the first permitted
> > algorithm
> > > > and
> > > > > using that for verification doesn't really buy us much in terms of
> > > > > preventing misconfiguration errors from occurring; the biggest
> > problem
> > > > > would be if these configurations weren't aligned across workers
> > (i.e.,
> > > if
> > > > > the signature algorithm on a follower wasn't included in the
> leader's
> > > > list
> > > > > of permitted algorithms) and that still wouldn't be caught at
> > startup.
> > > > > I'd advocate for keeping the two configurations separate, carefully
> > > > > documenting them, and enabling a startup check that forces at least
> > > that
> > > > > worker to be consistent with its own configs, but I think morphing
> > > those
> > > > > two configs into one is a little too implicit and unintuitive to be
> > > worth
> > > > > it.
> > > > >
> > > > >
> > > > > RE error log messages caused by not-yet-updated workers reading
> > session
> > > > > keys from the config topic: in the most common upgrade scenario,
> this
> > > > will
> > > > > never happen. No session key will be distributed by the leader
> until
> > > > > internal request verification is enabled, which will only happen
> once
> > > all
> > > > > workers have indicated that they support the new "sessioned"
> > > subprotocol
> > > > > during group coordination. The only circumstance that would lead to
> > > that
> > > > > kind of error message would be if a new worker joins a cluster that
> > has
> > > > > already been fully upgraded to use the new "sessioned" subprotocol,
> > but
> > > > for
> > > > > some reason this new worker is running an older version of the
> > Connect
> > > > > framework. In this case, the error message is probably warranted;
> > it's
> > > > > likely that something's going wrong here.
> > > > >
> > > > > RE potential error log messages due to stale keys: Mmm, good point.
> > My
> > > > > first instinct for rectifying this would be to differentiate the
> > > various
> > > > > causes of rejection for a request to this endpoint via HTTP status;
> > 400
> > > > > (bad request) for requests that either lack the required signature
> > and
> > > > > signature algorithm headers, specify an invalid
> > (non-base-64-decodable)
> > > > > signature, or specify a signature algorithm that isn't permitted by
> > the
> > > > > leader, and 403 (forbidden) for requests that contain well-formed
> > > values
> > > > > for the signature and signature algorithm headers, but which fail
> > > request
> > > > > verification. A follower can catch that 403 and, instead of
> logging a
> > > > > scary-looking ERROR level message, retry a limited number of times
> > with
> > > > > something INFO level before raising the stakes and making the error
> > > > message
> > > > > look scarier. I'll add this to the KIP but if you or anyone has
> other
> > > > > thoughts I'm happy to hear them.
> > > > >
> > > > > RE advertising the availability/current usage of internal request
> > > > signing:
> > > > > I think the primary mechanism that should be used to communicate
> this
> > > to
> > > > > the user is the worker log. Theoretically, you can just run `curl
> -H
> > > > > 'Content-Type: application/json' -d '[]' http://<connect
> > > > > worker>/anything/tasks` and if the feature is enabled, you'll get a
> > 400
> > > > and
> > > > > if it's disabled you'll either get a 404 or a 200 (if a connector
> > with
> > > > that
> > > > > name actually exists). However, nobody wants to do that :) So
> > instead,
> > > I
> > > > > think the following log messages would be appropriate for the
> > following
> > > > > scenarios:
> > > > > 1) A worker is started up with connect.protocol set to something
> > other
> > > > > than "sessioned" - a WARN level log message is emitted explaining
> > that
> > > > the
> > > > > Connect cluster will be vulnerable because internal request signing
> > is
> > > > > disabled, and providing information on how to enable request
> signing
> > if
> > > > > desired.
> > > > > 2) A worker with connect.protocol set to "sessioned" (or, in the
> > > future,
> > > > > any protocol that enables internal request signing) joins the
> cluster
> > > > > group, and it is observed that the subprotocol used by the cluster
> > does
> > > > not
> > > > > indicate support for internal request singing - a WARN level log
> > > message
> > > > is
> > > > > emitted explaining that even though this worker supports internal
> > > request
> > > > > signing, one or more workers in the cluster do not support this
> > > behavior
> > > > > and therefore it will be disabled; information should also be
> > provided
> > > on
> > > > > how to identify which workers these are and how to enable request
> > > signing
> > > > > with them if desired.
> > > > > I don't think an error is appropriate to log in either of these
> > cases,
> > > > > since it's possible that an intentional downgrade might be
> necessary.
> > > > I'll
> > > > > add this information (including proposed log message scenarios) to
> > the
> > > > KIP.
> > > > >
> > > > > RE upgrade UX: I think I've covered the reasoning behind exposing
> > these
> > > > > configurations even though the defaults are also reasonable. I can
> > call
> > > > out
> > > > > the automatic enabling of request signature verification and
> > highlight
> > > > the
> > > > > (hopefully seamless) standard upgrade path of a single rolling
> > upgrade
> > > in
> > > > > the KIP.
> > > > >
> > > > > RE rejected alternatives: Yep, good point. Will add to the KIP.
> > > > >
> > > > >
> > > > > Thanks again for all the feedback! Looking forward to hearing more.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Mon, Sep 9, 2019 at 4:49 PM Randall Hauch <rhauch@gmail.com>
> > wrote:
> > > > >
> > > > >> Thanks for putting this KIP together, Chris. It's thorough and
> well
> > > > >> thought
> > > > >> out, and you've done a great job responding to comments. It is
> > indeed
> > > > >> going
> > > > >> to be nice to harden the REST API a bit more.
> > > > >>
> > > > >> I do have a few questions/concerns/comments, all of which I think
> > can
> > > be
> > > > >> incorporated relatively easily.
> > > > >>
> > > > >> First, regarding the new Connect worker configs mentioned in the
> > > "Public
> > > > >> Interfaces" section:
> > > > >> a. Do we need to expose all of these, or could Connect just
> > internally
> > > > use
> > > > >> constants and only expo them in the future if necessary? Do users
> > > really
> > > > >> need to explicitly set any of these, or will most users simply use
> > the
> > > > >> defaults? Do we need to specify a different algorithm for
> > generating a
> > > > >> session key versus signing a request? Does the key size need to be
> > > > >> configured, or can we use a large enough value and grow this in
> the
> > > > future
> > > > >> with patch releases?
> > > > >> b. What happens if one worker has different values for any of
> these
> > > new
> > > > >> worker configs than the other workers? If there is an error, when
> > will
> > > > >> that
> > > > >> error occur (before startup happens, only when that worker
> attempts
> > to
> > > > use
> > > > >> the internal REST endpoint, etc.) and will the error be clear
> enough
> > > the
> > > > >> user knows how to fix the problem?
> > > > >> c. The section should mention the importance of each new
> > configuration
> > > > >> property. I suspect these will be "low" importance, since the
> > defaults
> > > > are
> > > > >> likely sufficient for most users.
> > > > >> d. Is there a window when both the new session key and the prior
> > > session
> > > > >> key are still valid? If so, what is that window?
> > > > >> e. Can the KIP emphasize more that the default values are likely
> to
> > be
> > > > >> sufficient for most users?
> > > > >> f. Are all of the defaults sufficient for the foreseeable future?
> > For
> > > > >> example, what is the cost of using a larger session key size?
> > > > >> g. Please mention whether these config properties follow or do not
> > > > follow
> > > > >> existing precedent within AK.
> > > > >> h. Might these settings be used to validate other kinds of
> > > intra-cluster
> > > > >> communication and messages? If so, the "internal.request" prefix
> > might
> > > > >> make
> > > > >> that difficult. Was there any thought about using a bit more
> generic
> > > > >> prefix, such as "cluster."?
> > > > >> i. The `internal.request.verification.algorithms` and
> > > > >> `internal.request.signature.algorithm` are dependent upon each
> > other,
> > > > and
> > > > >> it seems like it's possible for the "verification" algorithms to
> not
> > > > >> include the "signature" algorithm. We could catch cases like that
> > > > (though
> > > > >> not with simple Validator implementations since they are specific
> > to a
> > > > >> single property value), but would it be better to have a single
> list
> > > and
> > > > >> to
> > > > >> use the first item in the list as the signature algorithm?
> > > > >>
> > > > >> Second, the KIP proposes to write the session key to the existing
> > > config
> > > > >> topic using record(s) with a new key. Konstantine was correct when
> > he
> > > > said
> > > > >> earlier in the thread:
> > > > >>
> > > > >>
> > > > >> > Indeed, broadcasting the session key to the followers is
> > necessary.
> > > > But
> > > > >> > adding it to the configs topic with a new key is compatible with
> > > > >> previous
> > > > >> > versions (the key will be ignored by workers that don't support
> > this
> > > > >> > protocol) and works since all workers will need to read this
> topic
> > > to
> > > > >> the
> > > > >> > end to remain in the group.
> > > > >> >
> > > > >>
> > > > >> This is true that the KafkaConfigBackingStore will not act upon
> > config
> > > > >> record keys with keys that do not match the known pattern, but
> such
> > > > >> messages do result in an ERROR log message. I'm concerned that
> > > operators
> > > > >> will be alarmed (both emotionally and technically) if they begin
> to
> > > > >> upgrade
> > > > >> a Connect cluster and start seeing ERROR log messages. We could
> > change
> > > > >> these error messages to be warnings, but that will take effect
> only
> > > > after
> > > > >> a
> > > > >> worker is upgraded and restarted, and will not affect the
> > > > >> yet-to-be-upgraded workers in the Connect cluster. What can we do
> to
> > > > avoid
> > > > >> these ERROR log messages? Should the default be to not change
> > > > >> `connect.protocol`, allow the user to upgrade all clusters, and
> then
> > > to
> > > > >> change `connect.protocol=sessioned` with a (second) rolling
> upgrade?
> > > Or,
> > > > >> if
> > > > >> we decide to rely upon an upgrade recipe to avoid these, then we
> > > should
> > > > >> document that recipe here.
> > > > >>
> > > > >> Third, I think the KIP should address the potential for seeing one
> > or
> > > > more
> > > > >> "Failed to reconfigure connector's tasks, retrying after backoff"
> > > error
> > > > >> message during a key rotation. It would be good to eliminate the
> > > > >> condition,
> > > > >> maybe by returning a response indicating authorization failure and
> > > > >> signifying a key rotation is required, and to prevent an ERROR log
> > > > >> message.
> > > > >> I don't think separate INFO level log messages saying to disregard
> > > > earlier
> > > > >> ERROR log messages is sufficient.
> > > > >>
> > > > >> Fourth, how will an operator of a Connect cluster know whether
> this
> > > > >> internal endpoint is protected by authorization via this feature?
> > And
> > > > how
> > > > >> can an operator know which Connect workers are preventing a
> cluster
> > > from
> > > > >> enabling this feature? Should a warning or error be logged if this
> > > > feature
> > > > >> is disabled after being enabled?
> > > > >>
> > > > >> Finally, I have a few nits:
> > > > >>
> > > > >>    1. The "Backwards compatibility" section should be more focused
> > on
> > > > the
> > > > >>    upgrade UX. "All of the new worker configurations have sensible
> > > > >> defaults,
> > > > >>    and most users can simply upgrade without needing to override
> > > them."
> > > > >> (BTW,
> > > > >>    if this is true, then IMO that adds weight to only exposing a
> > > minimum
> > > > >>    number of new worker configs.)
> > > > >>    2. The rejected alternatives should include the earlier
> approach
> > of
> > > > >>    sharing the session keys over the Connect subprotocol, with
> pros
> > > and
> > > > >> cons.
> > > > >>
> > > > >> Overall, very nice work, Chris.
> > > > >>
> > > > >> Best regards,
> > > > >>
> > > > >> Randall
> > > > >>
> > > > >> On Fri, Sep 6, 2019 at 4:49 PM Chris Egerton <chrise@confluent.io
> >
> > > > wrote:
> > > > >>
> > > > >> > Hi all,
> > > > >> >
> > > > >> > I've published a draft PR implementing the changes currently
> > > proposed
> > > > in
> > > > >> > KIP-507, which can be found at
> > > > >> https://github.com/apache/kafka/pull/7310.
> > > > >> > Thanks for all of the review and feedback so far! Given the lack
> > of
> > > > any
> > > > >> > major voiced reservations so far, if I don't hear anything else
> > over
> > > > the
> > > > >> > course of the next few days or so I'll be opening this for a
> vote.
> > > > >> >
> > > > >> > Cheers,
> > > > >> >
> > > > >> > Chris
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Aug 28, 2019 at 12:21 PM Chris Egerton <
> > chrise@confluent.io
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > HI all,
> > > > >> > >
> > > > >> > > Wow, thanks for all the feedback! Happy to see this proposal
> > > getting
> > > > >> some
> > > > >> > > love :)
> > > > >> > >
> > > > >> > >
> > > > >> > > RE Konstantine's comments:
> > > > >> > >
> > > > >> > > > I've read the discussions regarding why the rebalancing
> > protocol
> > > > is
> > > > >> > used
> > > > >> > > here and your intention to follow the approach which was
> > recently
> > > > >> used in
> > > > >> > > order to elegantly support upgrades without requiring rolling
> > > > restarts
> > > > >> > > makes sense.
> > > > >> > > > However, I'd like to revisit the proposed extension of the
> > > connect
> > > > >> > > protocol
> > > > >> > > going a little bit more in detail.
> > > > >> > > Indeed, broadcasting the session key to the followers is
> > > necessary.
> > > > >> But
> > > > >> > > adding it to the configs topic with a new key is compatible
> with
> > > > >> previous
> > > > >> > > versions (the key will be ignored by workers that don't
> support
> > > this
> > > > >> > > protocol) and works since all workers will need to read this
> > topic
> > > > to
> > > > >> the
> > > > >> > > end to remain in the group.
> > > > >> > > > Additionally, to your point about consensus, meaning how the
> > > > workers
> > > > >> > all
> > > > >> > > know during the same generation that the "sessioned" version
> of
> > > the
> > > > >> > > protocol was chosen by the leader, it seems to me that an
> > explicit
> > > > >> > upgrade
> > > > >> > > of the protocol on the wire, on its name and on the metadata
> > that
> > > > are
> > > > >> > sent
> > > > >> > > with the join request by each worker might not be required.
> > What I
> > > > >> > believe
> > > > >> > > would suffice is merely an increment of the version of the
> > > protocol,
> > > > >> > > because all the other information remains the same (for
> example,
> > > the
> > > > >> list
> > > > >> > > of assigned and revoked tasks, etc). This would allow us not
> to
> > > > extend
> > > > >> > the
> > > > >> > > metadata even more with yet another JoinGroupRequest and could
> > > > achieve
> > > > >> > what
> > > > >> > > you want in terms of an orchestrated upgrade.
> > > > >> > >
> > > > >> > > I really like this idea, thanks for suggesting it. A simple
> bump
> > > of
> > > > >> the
> > > > >> > > protocol version does seem sufficient to achieve consensus
> among
> > > > >> workers
> > > > >> > > and would significantly reduce the implementation complexity
> of
> > an
> > > > >> > entirely
> > > > >> > > separate protocol just for the sake of distributing keys, and
> > > using
> > > > >> the
> > > > >> > > config topic to relay keys instead of the group coordination
> > > > protocol
> > > > >> > would
> > > > >> > > prevent rebalances solely for the sake of key rotation
> (instead,
> > > the
> > > > >> > leader
> > > > >> > > can just periodically write a new key to the config topic).
> I'll
> > > > >> update
> > > > >> > the
> > > > >> > > KIP by EOD today with these changes.
> > > > >> > >
> > > > >> > > > All the leader worker would have to check is whether all the
> > > > >> > assignments
> > > > >> > > that it took were in "sessioned" version (possibly
> > > > >> CONNECT_PROTOCOL_V2=2
> > > > >> > if
> > > > >> > > you'd choose to go this route).
> > > > >> > >
> > > > >> > > I like this idea for dictating whether the leader requires
> > request
> > > > >> > > signing, SGTM. I think we can have all workers use the latest
> > > > session
> > > > >> key
> > > > >> > > present in the config topic (if there is one) to sign their
> > > > requests;
> > > > >> > since
> > > > >> > > the signatures are passed via header, this shouldn't cause any
> > > > >> problems
> > > > >> > if
> > > > >> > > a follower signs a request that gets sent to a leader that
> isn't
> > > > >> > performing
> > > > >> > > request validation.
> > > > >> > >
> > > > >> > >
> > > > >> > > RE Magesh's comments:
> > > > >> > >
> > > > >> > > > Thanks a lot for the KIP. This will certainly be a useful
> > > > feature. I
> > > > >> > > would
> > > > >> > > have preferred to use the topic approach as well but I also
> > > > understand
> > > > >> > your
> > > > >> > > point of view about the operational complexity for upgrades.
> If
> > > not
> > > > >> with
> > > > >> > > this KIP, I would certainly want to go that route at some
> point
> > in
> > > > the
> > > > >> > > future.
> > > > >> > >
> > > > >> > > I'm actually a little conflicted on this front. If this KIP
> > passes
> > > > >> and we
> > > > >> > > have a secure way to perform quick, synchronous,
> > > follower-to-leader
> > > > >> > > communication, it may be better to hold onto and even perhaps
> > > expand
> > > > >> on
> > > > >> > in
> > > > >> > > case this functionality comes in handy later. Additionally,
> > there
> > > > >> would
> > > > >> > > have to be a lot of thought put into the semantics of using
> > > > >> topic-based
> > > > >> > > communication for relaying task configs to the leader due to
> its
> > > > >> > > asynchronous nature; I'm not convinced it's not worth it, but
> > I'm
> > > > also
> > > > >> > not
> > > > >> > > entirely convinced it is.
> > > > >> > >
> > > > >> > > > As far as using the rebalance protocol goes, it would be
> great
> > > if
> > > > >> you
> > > > >> > > could
> > > > >> > > elaborate on what exactly would be the rebalance impact when a
> > key
> > > > >> > expires.
> > > > >> > > I see that you have called out saying that there should be no
> > > > >> significant
> > > > >> > > impact but it will be great to explicitly state as what is to
> be
> > > > >> > expected.
> > > > >> > > I would prefer to not have any sorts of rebalancing when this
> > > > happens
> > > > >> > since
> > > > >> > > the connector and task assignments should not change with this
> > > > event.
> > > > >> It
> > > > >> > > will be useful to explain this a bit more.
> > > > >> > >
> > > > >> > > Given Konstantine's comments, this question isn't strictly
> > > relevant
> > > > >> > > anymore (unless we decide to return to the approach of
> > > distributing
> > > > >> > session
> > > > >> > > keys during rebalance), but just for completeness it does seem
> > > worth
> > > > >> > > addressing. The "rebalances" proposed for key rotation would
> be
> > > > >> achieved
> > > > >> > by
> > > > >> > > a request from the leader the rejoin the group. This would
> > > > immediately
> > > > >> > > result in the leader being asked to assign connectors and
> tasks
> > to
> > > > the
> > > > >> > > group, which would exactly match the existing assignments, so
> > with
> > > > >> that
> > > > >> > and
> > > > >> > > the help of the new incremental cooperative rebalance
> semantics
> > > > there
> > > > >> > would
> > > > >> > > be zero downtime for connectors or tasks. Truly, "rebalance"
> is
> > a
> > > > bit
> > > > >> of
> > > > >> > a
> > > > >> > > misnomer here; we're really talking about performing group
> > > > assignment
> > > > >> > > (which doesn't necessarily imply adding or removing
> > > connectors/tasks
> > > > >> > > to/from any workers), but as far as I've seen the two are
> > > basically
> > > > >> used
> > > > >> > > synonymously w/r/t Connect so I elected to use the more
> common,
> > > > albeit
> > > > >> > > slightly less accurate, term.
> > > > >> > >
> > > > >> > >
> > > > >> > > RE Greg's comments:
> > > > >> > >
> > > > >> > > > Does this roll-out ratchet the protocol version
> irreversibly?
> > > > >> > >
> > > > >> > > Nope, the rollout will be achieved by bumping the protocol
> (or,
> > > > after
> > > > >> > > adjusting for Konstantine's suggestion, just the protocol
> > version)
> > > > for
> > > > >> > each
> > > > >> > > worker, but the Kafka group coordinator will only begin using
> > that
> > > > new
> > > > >> > > protocol if every worker indicates that that protocol is
> > supported
> > > > >> when
> > > > >> > it
> > > > >> > > joins the group. So, if a cluster existed with workers that
> all
> > > > >> supported
> > > > >> > > protocols 1 and 2 and indicated a preference for protocol 2,
> the
> > > > >> cluster
> > > > >> > > would use protocol 2. However, if a new worker that only
> > supported
> > > > >> > protocol
> > > > >> > > 1 joined the cluster, the entire cluster would be bumped back
> > down
> > > > to
> > > > >> > > protocol 1. If/when that worker leaves or indicates it
> supports
> > > > >> protocol
> > > > >> > 2,
> > > > >> > > the cluster would bump back up to protocol 2.
> > > > >> > >
> > > > >> > > >What is the expected behavior when the feature has been
> > enabled,
> > > > and
> > > > >> a
> > > > >> > > worker joins the
> > > > >> > > group while advertising an old protocol?
> > > > >> > >
> > > > >> > > I partially answered this above; the entire cluster would
> > > downgrade
> > > > >> its
> > > > >> > > protocol version. The practical effect of this would be that
> > > request
> > > > >> > > signature verification by the leader would immediately cease.
> > > > >> > >
> > > > >> > > > I think it's reasonable to prevent single-worker downgrades,
> > > since
> > > > >> this
> > > > >> > > is an attack vector.
> > > > >> > > Actually, this is fine. Groups are a protected resource in
> > Kafka;
> > > if
> > > > >> you
> > > > >> > > want to ensure this doesn't happen, just secure your Kafka
> > broker
> > > > and
> > > > >> use
> > > > >> > > something like ACLs to make sure only authorized entities can
> > join
> > > > >> your
> > > > >> > > group.
> > > > >> > >
> > > > >> > > > But what happens if every worker in a cluster goes down, and
> > at
> > > > >> least
> > > > >> > > the first worker comes
> > > > >> > > back with an old protocol? This is a potential attack that
> > > requires
> > > > >> > access
> > > > >> > > to the workers, but could also be used by customers to
> > > intentionally
> > > > >> > > downgrade their cluster.
> > > > >> > >
> > > > >> > > This kind of attack actually doesn't require access to any of
> > the
> > > > >> Connect
> > > > >> > > workers; group coordination is managed by the Kafka cluster.
> So,
> > > > even
> > > > >> if
> > > > >> > > your entire Connect cluster crashes and then a malicious user
> > > tries
> > > > to
> > > > >> > join
> > > > >> > > the group, as long as your Kafka broker has restricted access
> to
> > > the
> > > > >> > group
> > > > >> > > used by your Connect cluster, that attack would still fail.
> > > > >> > >
> > > > >> > > > You mentioned that this security improvement only adds
> > security
> > > > when
> > > > >> > > there
> > > > >> > > are a number of existing security changes in place, one of
> these
> > > > being
> > > > >> > ACLs
> > > > >> > > on the Kafka broker. Do you plan to gate this feature roll-out
> > on
> > > > >> > detecting
> > > > >> > > that those prerequisite features are enabled?
> > > > >> > >
> > > > >> > > I wasn't planning on it. With the pluggable nature of Kafka
> and
> > > > Kafka
> > > > >> > > Connect authorization, I imagine it'd be pretty difficult to
> > know
> > > if
> > > > >> all
> > > > >> > of
> > > > >> > > the important resources (worker group, internal topics, and
> > Kafka
> > > > >> Connect
> > > > >> > > REST API are a few that come to mind) are actually secured.
> > > > >> Additionally,
> > > > >> > > the presence of other attack vectors shouldn't be
> justification
> > > for
> > > > >> > opening
> > > > >> > > up a new one; it seems particularly dangers in the event that
> > > > someone
> > > > >> > > accidentally misconfigures their Connect cluster and this
> > endpoint
> > > > >> ends
> > > > >> > up
> > > > >> > > being exposed even though the user believes it to be
> protected.
> > > > >> > >
> > > > >> > > > If not, I think this feature adds no additional security to
> > > > >> unsecured
> > > > >> > > clusters, while still incurring
> > > > >> > > the overhead on signing requests.
> > > > >> > >
> > > > >> > > This seems like a bit of an overstatement. The changes
> wouldn't
> > > add
> > > > no
> > > > >> > > additional security; I think restricting even accidental
> access
> > to
> > > > the
> > > > >> > > internal REST endpoint is a good thing since it's not part of
> > the
> > > > >> public
> > > > >> > > API and it's all but guaranteed that if someone makes a
> request
> > to
> > > > >> that
> > > > >> > > endpoint they're either confused or malicious. Adding another
> > > hurdle
> > > > >> in
> > > > >> > the
> > > > >> > > way for malicious users is also beneficial, even if it's not a
> > > > >> > > comprehensive solution.
> > > > >> > >
> > > > >> > > > Do you know how significant the overhead will be, as a rough
> > > > >> estimate?
> > > > >> > >
> > > > >> > > Not very significant. The overhead would only be incurred when
> > > > >> followers
> > > > >> > > need to relay task configurations to the leader, which
> shouldn't
> > > > >> happen
> > > > >> > > very frequently in any stable connector. If that is happening
> > > > >> frequently,
> > > > >> > > the overhead for starting new tasks or stopping/reconfiguring
> > > > existing
> > > > >> > > tasks would probably be orders of magnitude higher anyways.
> > > > >> > >
> > > > >> > >
> > > > >> > > Thanks again to Magesh, Konstantine, and Greg for your
> feedback.
> > > > I'll
> > > > >> be
> > > > >> > > updating the KIP to reflect changes mentioned here and to
> > include
> > > > >> answers
> > > > >> > > to some of the questions raised; looking forward to the next
> > round
> > > > of
> > > > >> > > comments.
> > > > >> > >
> > > > >> > > Cheers,
> > > > >> > >
> > > > >> > > Chris
> > > > >> > >
> > > > >> > > On Wed, Aug 28, 2019 at 9:02 AM Greg Harris <
> gregh@confluent.io
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > >> Hey Chris,
> > > > >> > >>
> > > > >> > >> The KIP makes sense to me, and I think it's a very natural
> way
> > to
> > > > >> solve
> > > > >> > >> the
> > > > >> > >> issue at hand. I had two questions about the automatic
> rollout
> > > > logic.
> > > > >> > >>
> > > > >> > >> Does this roll-out ratchet the protocol version irreversibly?
> > > What
> > > > is
> > > > >> > the
> > > > >> > >> expected behavior when the feature has been enabled, and a
> > worker
> > > > >> joins
> > > > >> > >> the
> > > > >> > >> group while advertising an old protocol? I think it's
> > reasonable
> > > to
> > > > >> > >> prevent
> > > > >> > >> single-worker downgrades, since this is an attack vector. But
> > > what
> > > > >> > happens
> > > > >> > >> if every worker in a cluster goes down, and at least the
> first
> > > > worker
> > > > >> > >> comes
> > > > >> > >> back with an old protocol? This is a potential attack that
> > > requires
> > > > >> > access
> > > > >> > >> to the workers, but could also be used by customers to
> > > > intentionally
> > > > >> > >> downgrade their cluster.
> > > > >> > >>
> > > > >> > >> You mentioned that this security improvement only adds
> security
> > > > when
> > > > >> > there
> > > > >> > >> are a number of existing security changes in place, one of
> > these
> > > > >> being
> > > > >> > >> ACLs
> > > > >> > >> on the Kafka broker. Do you plan to gate this feature
> roll-out
> > on
> > > > >> > >> detecting
> > > > >> > >> that those prerequisite features are enabled? If not, I think
> > > this
> > > > >> > feature
> > > > >> > >> adds no additional security to unsecured clusters, while
> still
> > > > >> incurring
> > > > >> > >> the overhead on signing requests. Do you know how significant
> > the
> > > > >> > overhead
> > > > >> > >> will be, as a rough estimate?
> > > > >> > >>
> > > > >> > >> Looks great!
> > > > >> > >>
> > > > >> > >> Thanks,
> > > > >> > >> Greg
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Tue, Aug 27, 2019 at 8:38 PM Konstantine Karantasis <
> > > > >> > >> konstantine@confluent.io> wrote:
> > > > >> > >>
> > > > >> > >> > Thanks for the KIP Chris!
> > > > >> > >> >
> > > > >> > >> > This proposal will fill a gap in Kafka Connect deployments
> > and
> > > > will
> > > > >> > >> > strengthen their production readiness in even more diverse
> > > > >> > >> environments, in
> > > > >> > >> > which current workarounds are less practical.
> > > > >> > >> >
> > > > >> > >> > I've read the discussions regarding why the rebalancing
> > > protocol
> > > > is
> > > > >> > used
> > > > >> > >> > here and your intention to follow the approach which was
> > > recently
> > > > >> used
> > > > >> > >> in
> > > > >> > >> > order to elegantly support upgrades without requiring
> rolling
> > > > >> restarts
> > > > >> > >> > makes sense.
> > > > >> > >> >
> > > > >> > >> > However, I'd like to revisit the proposed extension of the
> > > > connect
> > > > >> > >> protocol
> > > > >> > >> > going a little bit more in detail.
> > > > >> > >> > Indeed, broadcasting the session key to the followers is
> > > > necessary.
> > > > >> > But
> > > > >> > >> > adding it to the configs topic with a new key is compatible
> > > with
> > > > >> > >> previous
> > > > >> > >> > versions (the key will be ignored by workers that don't
> > support
> > > > >> this
> > > > >> > >> > protocol) and works since all workers will need to read
> this
> > > > topic
> > > > >> to
> > > > >> > >> the
> > > > >> > >> > end to remain in the group.
> > > > >> > >> >
> > > > >> > >> > Additionally, to your point about consensus, meaning how
> the
> > > > >> workers
> > > > >> > all
> > > > >> > >> > know during the same generation that the "sessioned"
> version
> > of
> > > > the
> > > > >> > >> > protocol was chosen by the leader, it seems to me that an
> > > > explicit
> > > > >> > >> upgrade
> > > > >> > >> > of the protocol on the wire, on its name and on the
> metadata
> > > that
> > > > >> are
> > > > >> > >> sent
> > > > >> > >> > with the join request by each worker might not be required.
> > > What
> > > > I
> > > > >> > >> believe
> > > > >> > >> > would suffice is merely an increment of the version of the
> > > > >> protocol,
> > > > >> > >> > because all the other information remains the same (for
> > > example,
> > > > >> the
> > > > >> > >> list
> > > > >> > >> > of assigned and revoked tasks, etc). This would allow us
> not
> > to
> > > > >> extend
> > > > >> > >> the
> > > > >> > >> > metadata even more with yet another JoinGroupRequest and
> > could
> > > > >> achieve
> > > > >> > >> what
> > > > >> > >> > you want in terms of an orchestrated upgrade.
> > > > >> > >> >
> > > > >> > >> > Without having exhaustively checked the various code
> paths, I
> > > > feel
> > > > >> > that
> > > > >> > >> > such an approach would be less heavy-handed in terms of
> > > extending
> > > > >> the
> > > > >> > >> > format of the connect protocol and would achieve a certain
> > > > >> separation
> > > > >> > of
> > > > >> > >> > concerns between the information that is required for
> > > rebalancing
> > > > >> and
> > > > >> > is
> > > > >> > >> > included in the protocol metadata and the information that
> is
> > > > >> required
> > > > >> > >> for
> > > > >> > >> > securing the internal Connect REST endpoint. In theory,
> this
> > > > method
> > > > >> > >> could
> > > > >> > >> > be used to even secure the eager version of the protocol,
> but
> > > I'd
> > > > >> > agree
> > > > >> > >> > that this out of the scope of the current proposal.
> > > > >> > >> >
> > > > >> > >> > All the leader worker would have to check is whether all
> the
> > > > >> > assignments
> > > > >> > >> > that it took were in "sessioned" version (possibly
> > > > >> > >> CONNECT_PROTOCOL_V2=2 if
> > > > >> > >> > you'd choose to go this route).
> > > > >> > >> >
> > > > >> > >> > Overall, this does not differ a lot from your current
> > proposal,
> > > > >> which
> > > > >> > I
> > > > >> > >> > think is already in the right direction.
> > > > >> > >> > Let me know what you think.
> > > > >> > >> >
> > > > >> > >> > Cheers,
> > > > >> > >> > Konstantine
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Tue, Aug 27, 2019 at 4:19 PM Magesh Nandakumar <
> > > > >> > mageshn@confluent.io
> > > > >> > >> >
> > > > >> > >> > wrote:
> > > > >> > >> >
> > > > >> > >> > > Hi Chris,
> > > > >> > >> > >
> > > > >> > >> > > Thanks a lot for the KIP. This will certainly be a useful
> > > > >> feature. I
> > > > >> > >> > would
> > > > >> > >> > > have preferred to use the topic approach as well but I
> also
> > > > >> > understand
> > > > >> > >> > your
> > > > >> > >> > > point of view about the operational complexity for
> > upgrades.
> > > If
> > > > >> not
> > > > >> > >> with
> > > > >> > >> > > this KIP, I would certainly want to go that route at some
> > > point
> > > > >> in
> > > > >> > the
> > > > >> > >> > > future.
> > > > >> > >> > >
> > > > >> > >> > > As far as using the rebalance protocol goes, it would be
> > > great
> > > > if
> > > > >> > you
> > > > >> > >> > could
> > > > >> > >> > > elaborate on what exactly would be the rebalance impact
> > when
> > > a
> > > > >> key
> > > > >> > >> > expires.
> > > > >> > >> > > I see that you have called out saying that there should
> be
> > no
> > > > >> > >> significant
> > > > >> > >> > > impact but it will be great to explicitly state as what
> is
> > to
> > > > be
> > > > >> > >> > expected.
> > > > >> > >> > > I would prefer to not have any sorts of rebalancing when
> > this
> > > > >> > happens
> > > > >> > >> > since
> > > > >> > >> > > the connector and task assignments should not change with
> > > this
> > > > >> > event.
> > > > >> > >> It
> > > > >> > >> > > will be useful to explain this a bit more.
> > > > >> > >> > >
> > > > >> > >> > > Thanks,
> > > > >> > >> > > Magesh
> > > > >> > >> > >
> > > > >> > >> > > On Wed, Aug 21, 2019 at 4:40 PM Chris Egerton <
> > > > >> chrise@confluent.io>
> > > > >> > >> > wrote:
> > > > >> > >> > >
> > > > >> > >> > > > Hi all,
> > > > >> > >> > > >
> > > > >> > >> > > > I've made some tweaks to the KIP that I believe are
> > > > >> improvements.
> > > > >> > >> More
> > > > >> > >> > > > detail can be found on the KIP page itself, but as a
> > brief
> > > > >> > summary,
> > > > >> > >> the
> > > > >> > >> > > > three changes are:
> > > > >> > >> > > >
> > > > >> > >> > > > - The removal of the internal.request.verification
> > property
> > > > in
> > > > >> > >> favor of
> > > > >> > >> > > > modifying the default value for the connect.protocol
> > > property
> > > > >> from
> > > > >> > >> > > > "compatible" to "sessioned"
> > > > >> > >> > > > - The renaming of some configurations to use better
> > > > terminology
> > > > >> > >> (mostly
> > > > >> > >> > > > just "request" instead of "key" where appropriate)
> > > > >> > >> > > > - The addition of two new configurations that dictate
> how
> > > > >> session
> > > > >> > >> keys
> > > > >> > >> > > are
> > > > >> > >> > > > to be generated
> > > > >> > >> > > >
> > > > >> > >> > > > Thanks Ryanne for the feedback so far, hope to hear
> from
> > > some
> > > > >> more
> > > > >> > >> of
> > > > >> > >> > you
> > > > >> > >> > > > soon :)
> > > > >> > >> > > >
> > > > >> > >> > > > Cheers,
> > > > >> > >> > > >
> > > > >> > >> > > > Chris
> > > > >> > >> > > >
> > > > >> > >> > > > On Mon, Aug 19, 2019 at 12:02 PM Chris Egerton <
> > > > >> > chrise@confluent.io
> > > > >> > >> >
> > > > >> > >> > > > wrote:
> > > > >> > >> > > >
> > > > >> > >> > > > > Hi Ryanne,
> > > > >> > >> > > > >
> > > > >> > >> > > > > The reasoning for this is provided in the KIP: "There
> > > would
> > > > >> be
> > > > >> > no
> > > > >> > >> > clear
> > > > >> > >> > > > > way to achieve consensus amongst the workers in a
> > cluster
> > > > on
> > > > >> > >> whether
> > > > >> > >> > to
> > > > >> > >> > > > > switch to this new behavior." To elaborate on
> this--it
> > > > would
> > > > >> be
> > > > >> > >> bad
> > > > >> > >> > if
> > > > >> > >> > > a
> > > > >> > >> > > > > follower worker began writing task configurations to
> > the
> > > > >> topic
> > > > >> > >> while
> > > > >> > >> > > the
> > > > >> > >> > > > > leader continued to only listen on the internal REST
> > > > >> endpoint.
> > > > >> > >> It's
> > > > >> > >> > > > > necessary to ensure that every worker in a cluster
> > > supports
> > > > >> this
> > > > >> > >> new
> > > > >> > >> > > > > behavior before switching it on.
> > > > >> > >> > > > >
> > > > >> > >> > > > > To be fair, it is technically possible to achieve
> > > consensus
> > > > >> > >> without
> > > > >> > >> > > using
> > > > >> > >> > > > > the group coordination protocol by adding new
> > > > configurations
> > > > >> and
> > > > >> > >> > using
> > > > >> > >> > > a
> > > > >> > >> > > > > multi-phase rolling upgrade, but the operational
> > > complexity
> > > > >> of
> > > > >> > >> this
> > > > >> > >> > > > > approach would significantly complicate things for
> the
> > > > >> default
> > > > >> > >> case
> > > > >> > >> > of
> > > > >> > >> > > > just
> > > > >> > >> > > > > wanting to bump your Connect cluster to the next
> > version
> > > > >> without
> > > > >> > >> > having
> > > > >> > >> > > > to
> > > > >> > >> > > > > alter your existing worker config files and with
> only a
> > > > >> single
> > > > >> > >> > restart
> > > > >> > >> > > > per
> > > > >> > >> > > > > worker. The proposed approach provides a much simpler
> > > user
> > > > >> > >> experience
> > > > >> > >> > > for
> > > > >> > >> > > > > the most common use case and doesn't impose much
> > > additional
> > > > >> > >> > complexity
> > > > >> > >> > > > for
> > > > >> > >> > > > > other use cases.
> > > > >> > >> > > > >
> > > > >> > >> > > > > I've updated the KIP to expand on points from your
> last
> > > > >> emails;
> > > > >> > >> let
> > > > >> > >> > me
> > > > >> > >> > > > > know what you think.
> > > > >> > >> > > > >
> > > > >> > >> > > > > Cheers,
> > > > >> > >> > > > >
> > > > >> > >> > > > > Chris
> > > > >> > >> > > > >
> > > > >> > >> > > > > On Thu, Aug 15, 2019 at 4:53 PM Ryanne Dolan <
> > > > >> > >> ryannedolan@gmail.com>
> > > > >> > >> > > > > wrote:
> > > > >> > >> > > > >
> > > > >> > >> > > > >> Thanks Chris, that makes sense.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> I know you have already considered this, but I'm not
> > > > >> convinced
> > > > >> > we
> > > > >> > >> > > should
> > > > >> > >> > > > >> rule out using Kafka topics for this purpose. That
> > would
> > > > >> enable
> > > > >> > >> the
> > > > >> > >> > > same
> > > > >> > >> > > > >> level of security without introducing any new
> > > > >> authentication or
> > > > >> > >> > > > >> authorization mechanisms (your keys). And as you
> say,
> > > > you'd
> > > > >> > need
> > > > >> > >> to
> > > > >> > >> > > lock
> > > > >> > >> > > > >> down Connect's topics and groups anyway.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> Can you explain what you mean when you say using
> Kafka
> > > > >> topics
> > > > >> > >> would
> > > > >> > >> > > > >> require
> > > > >> > >> > > > >> "reworking the group coordination protocol"? I don't
> > see
> > > > how
> > > > >> > >> these
> > > > >> > >> > are
> > > > >> > >> > > > >> related. Why would it matter if the workers sent
> Kafka
> > > > >> messages
> > > > >> > >> vs
> > > > >> > >> > > POST
> > > > >> > >> > > > >> requests among themselves?
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> Ryanne
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> On Thu, Aug 15, 2019 at 3:57 PM Chris Egerton <
> > > > >> > >> chrise@confluent.io>
> > > > >> > >> > > > >> wrote:
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> > Hi Ryanne,
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > Yes, if the Connect group is left unsecured then
> > that
> > > > is a
> > > > >> > >> > potential
> > > > >> > >> > > > >> > vulnerability. However, in a truly secure Connect
> > > > cluster,
> > > > >> > the
> > > > >> > >> > group
> > > > >> > >> > > > >> would
> > > > >> > >> > > > >> > need to be secured anyways to prevent attackers
> from
> > > > >> joining
> > > > >> > >> the
> > > > >> > >> > > group
> > > > >> > >> > > > >> with
> > > > >> > >> > > > >> > the intent to either snoop on connector/task
> > > > >> configurations
> > > > >> > or
> > > > >> > >> > bring
> > > > >> > >> > > > the
> > > > >> > >> > > > >> > cluster to a halt by spamming the group with
> > > membership
> > > > >> > >> requests
> > > > >> > >> > and
> > > > >> > >> > > > >> then
> > > > >> > >> > > > >> > not running the assigned connectors/tasks.
> > > Additionally,
> > > > >> for
> > > > >> > a
> > > > >> > >> > > Connect
> > > > >> > >> > > > >> > cluster to be secure, access to internal topics
> (for
> > > > >> configs,
> > > > >> > >> > > offsets,
> > > > >> > >> > > > >> and
> > > > >> > >> > > > >> > statuses) would also need to be restricted so that
> > > > >> attackers
> > > > >> > >> could
> > > > >> > >> > > > not,
> > > > >> > >> > > > >> > e.g., write arbitrary connector/task
> configurations
> > to
> > > > the
> > > > >> > >> configs
> > > > >> > >> > > > >> topic.
> > > > >> > >> > > > >> > This is all currently possible in Kafka with the
> use
> > > of
> > > > >> ACLs.
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > I think the bottom line here is that there's a
> > number
> > > of
> > > > >> > steps
> > > > >> > >> > that
> > > > >> > >> > > > >> need to
> > > > >> > >> > > > >> > be taken to effectively lock down a Connect
> cluster;
> > > the
> > > > >> > point
> > > > >> > >> of
> > > > >> > >> > > this
> > > > >> > >> > > > >> KIP
> > > > >> > >> > > > >> > is to close a loophole that exists even after all
> of
> > > > those
> > > > >> > >> steps
> > > > >> > >> > are
> > > > >> > >> > > > >> taken,
> > > > >> > >> > > > >> > not to completely secure this one vulnerability
> even
> > > > when
> > > > >> no
> > > > >> > >> other
> > > > >> > >> > > > >> security
> > > > >> > >> > > > >> > measures are taken.
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > Cheers,
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > Chris
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > On Wed, Aug 14, 2019 at 10:56 PM Ryanne Dolan <
> > > > >> > >> > > ryannedolan@gmail.com>
> > > > >> > >> > > > >> > wrote:
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > > Chris, I don't understand how the rebalance
> > protocol
> > > > >> can be
> > > > >> > >> used
> > > > >> > >> > > to
> > > > >> > >> > > > >> give
> > > > >> > >> > > > >> > > out session tokens in a secure way. It seems
> that
> > > any
> > > > >> > >> attacker
> > > > >> > >> > > could
> > > > >> > >> > > > >> just
> > > > >> > >> > > > >> > > join the group and sign requests with the
> provided
> > > > >> token.
> > > > >> > Am
> > > > >> > >> I
> > > > >> > >> > > > missing
> > > > >> > >> > > > >> > > something?
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > > Ryanne
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > > On Wed, Aug 14, 2019, 2:31 PM Chris Egerton <
> > > > >> > >> > chrise@confluent.io>
> > > > >> > >> > > > >> wrote:
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > > > The KIP page can be found at
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >>
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-507%3A+Securing+Internal+Connect+REST+Endpoints
> > > > >> > >> > > > >> > > > ,
> > > > >> > >> > > > >> > > > by the way. Apologies for neglecting to
> include
> > it
> > > > in
> > > > >> my
> > > > >> > >> > initial
> > > > >> > >> > > > >> email!
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > On Wed, Aug 14, 2019 at 12:29 PM Chris
> Egerton <
> > > > >> > >> > > > chrise@confluent.io
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > > > wrote:
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > > Hi all,
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > I'd like to start discussion on a KIP to
> > secure
> > > > the
> > > > >> > >> internal
> > > > >> > >> > > > "POST
> > > > >> > >> > > > >> > > > > /connectors/<name>/tasks" endpoint for the
> > > Connect
> > > > >> > >> > framework.
> > > > >> > >> > > > The
> > > > >> > >> > > > >> > > > proposed
> > > > >> > >> > > > >> > > > > changes address a vulnerability in the
> > framework
> > > > in
> > > > >> its
> > > > >> > >> > > current
> > > > >> > >> > > > >> state
> > > > >> > >> > > > >> > > > that
> > > > >> > >> > > > >> > > > > allows malicious users to write arbitrary
> task
> > > > >> > >> > configurations
> > > > >> > >> > > > for
> > > > >> > >> > > > >> > > > > connectors; it is vital that this issue be
> > > > >> addressed in
> > > > >> > >> > order
> > > > >> > >> > > > for
> > > > >> > >> > > > >> any
> > > > >> > >> > > > >> > > > > Connect cluster to be secure.
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > Looking forward to your thoughts,
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > Chris
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >>
> > > > >> > >> > > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message