kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <...@confluent.io>
Subject Re: [VOTE] KIP-455: Create an Administrative API for Replica Reassignment
Date Tue, 30 Jul 2019 21:18:17 GMT
Hi, Colin,

Thanks for the KIP. Sorry for the late reply. LGTM overall. A few detailed
comments below.

10. The KIP adds two new fields (AddingReplicas and RemovingReplicas) to
LeaderAndIsr request. Could you explain how these 2 fields will be used?
Should we include those two fields in UpdateMetadata and potentially
Metadata requests too?

11. "If a new reassignment is issued during an on-going one, we cancel the
current one by emptying out both AR and RR, constructing them from (the
updated from the last-reassignment) R and TR, and starting anew." In this
case, it seems that the controller needs to issue a StopReplica request to
remove those unneeded replicas.

12. "Essentially, once a cancellation is called we subtract AR from R,
empty out both AR and RR, and send LeaderAndIsr requests to cancel the
replica movements that have not yet completed." Similar to the above, it
seems the controller needs to issue a StopReplica request to remove those
unneeded replicas.

13. Since we changed the format of the topics/[topic] zNode, should we bump
up the version number in the json value?

Jun

On Mon, Jul 22, 2019 at 8:38 AM Colin McCabe <cmccabe@apache.org> wrote:

> Hi all,
>
> With three non-binding +1 votes from Viktor Somogyi-Vass, Robert Barrett,
> and George Li, and 3 binding +1 votes from Gwen Shapira, Jason Gustafson,
> and myself, the vote passes.  Thanks, everyone!
>
> best,
> Colin
>
> On Fri, Jul 19, 2019, at 08:55, Robert Barrett wrote:
> > +1 (non-binding). Thanks for the KIP!
> >
> > On Thu, Jul 18, 2019 at 5:59 PM George Li <sql_consulting@yahoo.com
> .invalid>
> > wrote:
> >
> > >  +1 (non-binding)
> > >
> > >
> > >
> > > Thanks for addressing the comments.
> > > George
> > >
> > >     On Thursday, July 18, 2019, 05:03:58 PM PDT, Gwen Shapira <
> > > gwen@confluent.io> wrote:
> > >
> > >  Renewing my +1, thank you Colin and Stan for working through all the
> > > questions, edge cases, requests and alternatives. We ended up with a
> > > great protocol.
> > >
> > > On Thu, Jul 18, 2019 at 4:54 PM Jason Gustafson <jason@confluent.io>
> > > wrote:
> > > >
> > > > +1 Thanks for the KIP. Really looking forward to this!
> > > >
> > > > -Jason
> > > >
> > > > On Wed, Jul 17, 2019 at 1:41 PM Colin McCabe <cmccabe@apache.org>
> wrote:
> > > >
> > > > > Thanks, Stanislav.  Let's restart the vote to reflect the fact that
> > > we've
> > > > > made significant changes.  The new vote will go for 3 days as
> usual.
> > > > >
> > > > > I'll start with my +1 (binding).
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > >
> > > > > On Wed, Jul 17, 2019, at 08:56, Stanislav Kozlovski wrote:
> > > > > > Hey everybody,
> > > > > >
> > > > > > We have further iterated on the KIP in the accompanying
> discussion
> > > thread
> > > > > > and I'd like to propose we resume the vote.
> > > > > >
> > > > > > Some notable changes:
> > > > > > - we will store reassignment information in the
> > > `/brokers/topics/[topic]`
> > > > > > - we will internally use two collections to represent a
> reassignment
> > > -
> > > > > > "addingReplicas" and "removingReplicas". LeaderAndIsr has been
> > > updated
> > > > > > accordingly
> > > > > > - the Alter API will still use the "targetReplicas" collection,
> but
> > > the
> > > > > > List API will now return three separate collections - the full
> > > replica
> > > > > set,
> > > > > > the replicas we are adding as part of this reassignment
> > > > > ("addingReplicas")
> > > > > > and the replicas we are removing ("removingReplicas")
> > > > > > - cancellation of a reassignment now means a proper rollback
of
> the
> > > > > > assignment to its original state prior to the API call
> > > > > >
> > > > > > As always, you can re-read the KIP here
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> > > > > >
> > > > > > Best,
> > > > > > Stanislav
> > > > > >
> > > > > > On Wed, May 22, 2019 at 6:12 PM Colin McCabe <cmccabe@apache.org
> >
> > > wrote:
> > > > > >
> > > > > > > Hi George,
> > > > > > >
> > > > > > > Thanks for taking a look.  I am working on getting a PR
done
> as a
> > > > > > > proof-of-concept.  I'll post it soon.  Then we'll finish
up the
> > > vote.
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > > > On Tue, May 21, 2019, at 17:33, George Li wrote:
> > > > > > > >  Hi Colin,
> > > > > > > >
> > > > > > > >  Great! Looking forward to these features.    +1
> (non-binding)
> > > > > > > >
> > > > > > > > What is the estimated timeline to have this implemented?
 If
> any
> > > help
> > > > > > > > is needed in the implementation of cancelling
> reassignments,  I
> > > can
> > > > > > > > help if there is spare cycle.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > George
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    On Thursday, May 16, 2019, 9:48:56 AM PDT, Colin
McCabe
> > > > > > > > <cmccabe@apache.org> wrote:
> > > > > > > >
> > > > > > > >  Hi George,
> > > > > > > >
> > > > > > > > Yes, KIP-455 allows the reassignment of individual
> partitions to
> > > be
> > > > > > > > cancelled.  I think it's very important for these
operations
> to
> > > be at
> > > > > > > > the partition level.
> > > > > > > >
> > > > > > > > best,
> > > > > > > > Colin
> > > > > > > >
> > > > > > > > On Tue, May 14, 2019, at 16:34, George Li wrote:
> > > > > > > > >  Hi Colin,
> > > > > > > > >
> > > > > > > > > Thanks for the updated KIP.  It has very good
improvements
> of
> > > Kafka
> > > > > > > > > reassignment operations.
> > > > > > > > >
> > > > > > > > > One question, looks like the KIP includes the
Cancellation
> of
> > > > > > > > > individual pending reassignments as well when
the
> > > > > > > > > AlterPartitionReasisgnmentRequest has empty replicas
for
> the
> > > > > > > > > topic/partition. Will you also be implementing
the the
> > > partition
> > > > > > > > > cancellation/rollback in the PR ?    If yes,
 it will make
> > > KIP-236
> > > > > (it
> > > > > > > > > has PR already) trivial, since the cancel all
pending
> > > > > reassignments,
> > > > > > > > > one just needs to do a ListPartitionRessignmentRequest,
> then
> > > submit
> > > > > > > > > empty replicas for all those topic/partitions
in
> > > > > > > > > one AlterPartitionReasisgnmentRequest.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > George
> > > > > > > > >
> > > > > > > > >    On Friday, May 10, 2019, 8:44:31 PM PDT, Colin
McCabe
> > > > > > > > > <cmccabe@apache.org> wrote:
> > > > > > > > >
> > > > > > > > >  On Fri, May 10, 2019, at 17:34, Colin McCabe
wrote:
> > > > > > > > > > On Fri, May 10, 2019, at 16:43, Jason Gustafson
wrote:
> > > > > > > > > > > Hi Colin,
> > > > > > > > > > >
> > > > > > > > > > > I think storing reassignment state
at the partition
> level
> > > is
> > > > > the
> > > > > > > right move
> > > > > > > > > > > and I also agree that replicas should
understand that
> > > there is
> > > > > a
> > > > > > > > > > > reassignment in progress. This makes
KIP-352 a trivial
> > > > > follow-up
> > > > > > > for
> > > > > > > > > > > example. The only doubt I have is whether
the leader
> and
> > > isr
> > > > > znode
> > > > > > > is the
> > > > > > > > > > > right place to store the target reassignment.
It is a
> bit
> > > odd
> > > > > to
> > > > > > > keep the
> > > > > > > > > > > target assignment in a separate place
from the current
> > > > > assignment,
> > > > > > > right? I
> > > > > > > > > > > assume the thinking is probably that
although the
> current
> > > > > > > assignment should
> > > > > > > > > > > probably be in the leader and isr znode
as well, it is
> > > hard to
> > > > > > > move the
> > > > > > > > > > > state in a compatible way. Is that
right? But if we
> have no
> > > > > plan
> > > > > > > to remove
> > > > > > > > > > > the assignment znode, do you see a
downside to storing
> the
> > > > > target
> > > > > > > > > > > assignment there as well?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > That's a good point -- it's probably better
to keep the
> > > target
> > > > > > > > > > assignment in the same znode as the current
assignment,
> for
> > > > > > > > > > consistency.  I'll change the KIP.
> > > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Thanks again for the review.
> > > > > > > > >
> > > > > > > > > I took another look at this, and I think we should
stick
> with
> > > the
> > > > > > > > > initial proposal of putting the reassignment
state into
> > > > > > > > > /brokers/topics/[topic]/partitions/[partitionId]/state.
> The
> > > > > reason is
> > > > > > > > > because we'll want to bump the leader epoch for
the
> partition
> > > when
> > > > > > > > > changing the reassignment state, and the leader
epoch
> resides
> > > in
> > > > > that
> > > > > > > > > znode anyway.  I agree there is some inconsistency
here,
> but
> > > so be
> > > > > it:
> > > > > > > > > if we were to greenfield these zookeeper data
structures,
> we
> > > might
> > > > > do
> > > > > > > > > it differently, but the proposed scheme will
work fine and
> be
> > > > > > > > > extensible for the future.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > A few additional questions:
> > > > > > > > > > >
> > > > > > > > > > > 1. Should `alterPartitionReassignments`
be
> > > > > > > `alterPartitionAssignments`?
> > > > > > > > > > > It's the current assignment we're altering,
right?
> > > > > > > > > >
> > > > > > > > > > That's fair.  AlterPartitionAssigments reads
a little
> > > better, and
> > > > > > > I'll
> > > > > > > > > > change it to that.
> > > > > > > > >
> > > > > > > > > +1.  I've changed the RPC and API name in the
wiki.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. Does this change affect the Metadata
API? In other
> > > words,
> > > > > are
> > > > > > > clients
> > > > > > > > > > > aware of reassignments? If so, then
we probably need a
> > > change
> > > > > to
> > > > > > > > > > > UpdateMetadata as well. The only alternative
I can
> think of
> > > > > would
> > > > > > > be to
> > > > > > > > > > > represent the replica set in the Metadata
request as
> the
> > > union
> > > > > of
> > > > > > > the
> > > > > > > > > > > current and target replicas, but I
can't think of any
> > > benefit
> > > > > to
> > > > > > > hiding
> > > > > > > > > > > reassignments. Note that if we did
this, we probably
> > > wouldn't
> > > > > need
> > > > > > > a
> > > > > > > > > > > separate API to list reassignments.
> > > > > > > > > >
> > > > > > > > > > I thought about this a bit... and I think
on balance,
> you're
> > > > > right.
> > > > > > > We
> > > > > > > > > > should keep this information together with
the replica
> > > nodes, isr
> > > > > > > > > > nodes, and offline replicas, and that information
is
> > > available in
> > > > > > > the
> > > > > > > > > > MetadataResponse.
> > > > > > > > > >  However, I do think in order to do this,
we'll need a
> flag
> > > in
> > > > > the
> > > > > > > > > > MetadataRequest that specifiies "only show
me reassigning
> > > > > > > partitions".
> > > > > > > > > > I'll add this.
> > > > > > > > >
> > > > > > > > > I revisited this, and I think we should stick
with the
> original
> > > > > > > > > proposal of having a separate ListPartitionReassignments
> API.
> > > > > There
> > > > > > > > > really is no use case where the Producer or Consumer
needs
> to
> > > know
> > > > > > > > > about a reassignment.  They should just be notified
when
> the
> > > set of
> > > > > > > > > partitions changes, which doesn't require changes
to
> > > > > > > > > MetadataRequest/Response.  The Admin client only
cares if
> > > someone
> > > > > is
> > > > > > > > > managing the reassignment.  So adding this state
to the
> > > > > > > > > MetadataResponse adds overhead for no real benefit.
 In the
> > > common
> > > > > > > case
> > > > > > > > > where there is no ongoing reassignment, it would
be 4
> bytes per
> > > > > > > > > partition of extra overhead in the MetadataResponse.
> > > > > > > > >
> > > > > > > > > In general, I think we have a problem of oversharing
in the
> > > > > > > > > MetadataRequest/Response.  As we 10x or 100x
the number of
> > > > > partitions
> > > > > > > > > we support, we'll need to get stricter about
giving clients
> > > only
> > > > > the
> > > > > > > > > information they actually need, about the partitions
they
> > > actually
> > > > > > > care
> > > > > > > > > about.  Reassignment state clearly falls in the
category of
> > > state
> > > > > that
> > > > > > > > > isn't needed by clients (except very specialized
> rebalancing
> > > > > programs).
> > > > > > > > >
> > > > > > > > > Another important consideration here is that
someone
> managing
> > > an
> > > > > > > > > ongoing reassignment wants the most up-to-date
information,
> > > which
> > > > > is
> > > > > > > to
> > > > > > > > > be found on the controller.  Therefore adding
this state to
> > > > > listTopics
> > > > > > > > > or describeTopics, which could contact any node
in the
> > > cluster, is
> > > > > > > > > sub-optimal.
> > > > > > > > >
> > > > > > > > > Finally, adding this to listTopics or describeTopics
feels
> > > like a
> > > > > > > warty
> > > > > > > > > API.  It's an extra boolean which interacts with
other
> extra
> > > > > booleans
> > > > > > > > > like "show internal", etc. in weird ways.  I
think a
> separate
> > > API
> > > > > is
> > > > > > > > > cleaner.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. As replicas come into sync, they
will join the ISR.
> > > Will we
> > > > > > > await all
> > > > > > > > > > > target replicas joining the ISR before
taking the
> replica
> > > out
> > > > > of
> > > > > > > the target
> > > > > > > > > > > replicas set? Also, I assume that target
replicas can
> > > still be
> > > > > > > elected as
> > > > > > > > > > > leader?
> > > > > > > > > >
> > > > > > > > > > We'll take a replica out of the target replicas
set as
> soon
> > > as
> > > > > that
> > > > > > > > > > replica is in the ISR.  Let me clarify this
in the KIP.
> > > > > > > > > >
> > > > > > > > > > > 4. Probably useful to mention permissions
for the new
> APIs.
> > > > > > > > > >
> > > > > > > > > > Good point.  I think alterPartitionAssignments
should
> require
> > > > > ALTER
> > > > > > > on
> > > > > > > > > > CLUSTER.  MetadataRequest permissions will
be unchanged.
> > > > > > > > >
> > > > > > > > > I added permission information.
> > > > > > > > >
> > > > > > > > > best,
> > > > > > > > > Colin
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > best,
> > > > > > > > > > Colin
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > > > On Fri, May 10, 2019 at 9:30 AM Gwen
Shapira <
> > > > > gwen@confluent.io>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > > Looks great, and will be awesome
to have this new
> > > capability.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, May 8, 2019 at 10:23 PM
Colin McCabe <
> > > > > cmccabe@apache.org>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd like to start the vote
for KIP-455: Create an
> > > > > > > Administrative API for
> > > > > > > > > > > > > Replica Reassignment.  I
think this KIP is
> important
> > > since
> > > > > it
> > > > > > > will unlock
> > > > > > > > > > > > > many follow-on improvements
to Kafka reassignment
> (see
> > > the
> > > > > > > "Future work"
> > > > > > > > > > > > > section, plus a lot of the
other discussions we've
> had
> > > > > > > recently about
> > > > > > > > > > > > > reassignment).  It also furthers
the important
> KIP-4
> > > goal
> > > > > of
> > > > > > > removing
> > > > > > > > > > > > > direct access to ZK.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I made a few changes based
on the discussion in the
> > > > > [DISCUSS]
> > > > > > > thread.  As
> > > > > > > > > > > > > Robert suggested, I removed
the need to explicitly
> > > cancel a
> > > > > > > reassignment
> > > > > > > > > > > > > for a partition before setting
up a different
> > > reassignment
> > > > > for
> > > > > > > that
> > > > > > > > > > > > > specific partition.  I also
simplified the API a
> bit by
> > > > > adding
> > > > > > > a
> > > > > > > > > > > > > PartitionReassignment class
which is used by both
> the
> > > alter
> > > > > > > and list
> > > > > > > > > > > > APIs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I modified the proposal so
that we now deprecate
> the
> > > old
> > > > > > > znode-based API
> > > > > > > > > > > > > rather than removing it completely.
 That should
> give
> > > > > external
> > > > > > > > > > > > rebalancing
> > > > > > > > > > > > > tools some time to transition
to the new API.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To clarify a question Viktor
asked, I added a note
> > > that the
> > > > > > > > > > > > > kafka-reassign-partitions.sh
will now use a
> > > > > --bootstrap-server
> > > > > > > argument
> > > > > > > > > > > > to
> > > > > > > > > > > > > contact the admin APIs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > thanks,
> > > > > > > > > > > > > Colin
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > *Gwen Shapira*
> > > > > > > > > > > > Product Manager | Confluent
> > > > > > > > > > > > 650.450.2760 | @gwenshap
> > > > > > > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc>
> |
> > > blog
> > > > > > > > > > > > <http://www.confluent.io/blog>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Gwen Shapira
> > > Product Manager | Confluent
> > > 650.450.2760 | @gwenshap
> > > Follow us: Twitter | blog
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message