kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Kozlovski <stanis...@confluent.io>
Subject Re: [DISCUSS] KIP-455 Create an Admin API for Replica Reassignments
Date Fri, 12 Jul 2019 10:05:11 GMT
Hey Jan,

We will support multiple Admin APIs that can fetch on-going assignments at
different granularities, regardless of which (JVM) process scheduled it.
If a process doesn't know which partitions/topics are being reassigned, it
can ask for all on-going reassignments in the Kafka cluster.
You will only be able to get the current state of reassignments. We will
not support fetching the progress of any particular reassignment call from
the server-side. If the client is interested in a particular reassignment's
progress, it should know what that reassignment consisted of and query
those partitions only.

On Fri, Jul 12, 2019 at 10:04 AM Jan Filipiak <Jan.Filipiak@trivago.com>
wrote:

> Great KIP,
>
> pure java cruise-control would be a nice thing to have <3
>
> I just want to ask what the opinions are on a way to get the Futures of
> the assignment back. Say accross JVms.
>
> Best Jan
>
> On 02.07.2019 19:47, Stanislav Kozlovski wrote:
> > Hey there, I need to start a new thread on KIP-455. I think there might
> be
> > an issue with the mailing server. For some reason, my replies to the
> > previous discussion thread could not be seen by others. After numerous
> > attempts, Colin suggested I start a new thread.
> >
> > Original Discussion Thread:
> >
> https://sematext.com/opensee/m/Kafka/uyzND1Yl7Er128CQu1?subj=+DISCUSS+KIP+455+Create+an+Administrative+API+for+Replica+Reassignment
> > KIP:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> > Last Reply of Previous Thread:
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201906.mbox/%3C679a4c5b-3da6-4556-bb89-e680d8cbb705%40www.fastmail.com%3E
> >
> > The following is my reply:
> > ----
> > Hi again,
> >
> > This has been a great discussion on a tricky KIP. I appreciate
> everybody's
> > involvement in improving this crucial API.
> > That being said, I wanted to apologize for my first comment, it was a bit
> > rushed and not thought out.
> >
> > I've got a few questions now that I dove into this better:
> >
> > 1. Does it make sense to have an easy way to cancel all ongoing
> > reassignments? To cancel all ongoing reassignments, users had the crude
> > option of deleting the znode, bouncing the controller and running the
> > rollback JSON assignment that kafka-reassign-partitions.sh gave them
> > (KAFKA-6304).
> > Now that we support multiple reassignment requests, users may add execute
> > them incrementally. Suppose something goes horribly wrong and they want
> to
> > revert as quickly as possible - they would need to run the tool with
> > multiple rollback JSONs.  I think that it would be useful to have an easy
> > way to stop all ongoing reassignments for emergency situations.
> >
> > ---------
> >
> > 2. Our kafka-reassign-partitions.sh tool doesn't seem to currently let
> you
> > figure out the ongoing assignments - I guess we expect people to use
> > kafka-topics.sh for that. I am not sure how well that would continue to
> > work now that we update the replica set only after the new replica joins
> > the ISR.
> > Do you think it makes sense to add an option for listing the current
> > reassignments to the reassign tool as part of this KIP?
> >
> > We might want to think whether we want to show the TargetReplicas
> > information in the kafka-topics command for completeness as well. That
> > might involve the need to update the DescribeTopicsResponse. Personally I
> > can't see a downside but I haven't given it too much thought. I fully
> agree
> > that we don't want to add the target replicas to the full replica set and
> > nothing useful comes out of telling users they have a replica that might
> > not have copied a single byte. Yet, telling them that we have the
> intention
> > of copying bytes sounds useful so maybe having a separate column in
> > kafka-topics.sh would provide better clarity?
> >
> > ---------
> >
> > 3. What happens if we do another reassignment to a partition while one is
> > in progress? Do we overwrite the TargetReplicas?
> > In the example sequence you gave:
> > R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> > What would the behavior be if a new reassign request came with
> > TargetReplicas of [7, 8, 9] for that partition?
> >
> > To avoid complexity and potential race conditions, would it make sense to
> > reject a reassign request once one is in progress for the specific
> > partition, essentially forcing the user to cancel it first?
> > Forcing the user to cancel has the benefit of being explicit and guarding
> > against human mistakes. The downside I can think of is that in some
> > scenarios it might be inefficient, e.g
> > R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> > Cancel request sent out. Followed by a new reassign request with
> > TargetReplicas of [5, 6, 7] (note that 5 and 6 already fully copied the
> > partition). Becomes a bit of a race condition of whether we deleted the
> > partitions in between requests or not - I assume in practice this won't
> be
> > an issue. I still feel like I prefer the explicit cancellation step
> >
> > ---------
> >
> > 4. My biggest concern - I want to better touch on the interaction between
> > the new API and the current admin/reassign_partitions znode, the
> > compatibility and our strategy there.
> > The KIP says:
> >
> >> For compatibility purposes, we will continue to allow assignments to be
> >> submitted through the /admin/reassign_partitions node. Just as with the
> >> current code, this will only be possible if there are no current
> >> assignments. In other words, the znode has two states: empty and waiting
> >> for a write, and non-empty because there are assignments in progress.
> Once
> >> the znode is non-empty, further writes to it will be ignored.
> >
> > Given the current proposal, I can think of 4 scenarios I want to get a
> > better understanding of:
> >
> > *(i, ii, iii, iiii talk about the reassignment of the same one partition
> > only - partitionA)*
> >
> > i. znode is empty, new reassignment triggered via API, znode is updated
> > When the new reassignment is triggered via the API, do we create the
> znode
> > or do we allow a separate tool to trigger another reassignment through
> it?
> >
> > ii. (assuming we allow creating the znode as with scenario "i"): znode is
> > empty, new reassignment triggered via API, znode is updated, znode is
> > DELETED
> > My understand is that deleting the znode does not do anything until the
> > Controller is bounced - is that correct?
> > If so, this means that nothing will happen. If the Controller is bounced,
> > the reassignment state will still be live in the [partitionId]/state
> znode
> >
> > iii. znode is updated, new reassignment triggered via API
> > We override the reassignment for partitionA. The reassign_partitions
> znode
> > is showing stale data, correct?
> >
> > iiii. znode is updated, new reassignment triggered via API, controller
> > failover
> > What does the controller believe - the [partitionId]/state znode or the
> > /reassign_partitions ? I would assume the [partitionId]/state znode since
> > in this case we want the reassignment API call to be the correct one. I
> > think that opens up the possibility of missing a freshly-set
> > /reassign_partitions though (e.g if it was empty and was set right during
> > controller failover)
> >
> > iiiii. znode is updated to move partitionA, new reassignment triggered
> via
> > API for partitionB, partitionA move finishes
> > At this point, do we delete the znode or do we wait until the partitionB
> > move finishes as well?
> >
> > From the discussion here:
> >
> >> There's no guarantee that what is in the znode reflects the current
> >> reassignments that are going on.  The only thing you can know is that if
> >> the znode exists, there is at least one reassignment going on.
> >
> > This is changing the expected behavior of a tool that obeys Kafka's
> current
> > behavior though. It is true that updating the znode while a reassignment
> is
> > in progress has no effect but make ZK misleading but tools might have
> grown
> > to follow that rule and only update the znode once it is empty. I think
> we
> > might want to be more explicit when making such changes - I had seen
> > discontentment in the community from the fact that we had changed the
> znode
> > updating behavior in a MINOR pull request.
> >
> > I feel it is complex to support both APIs and make sure we don't have
> > unhandled edge cases. I liked Bob's suggestion on potentially allowing
> only
> > one via a feature flag:
> >
> >> Could we temporarily support
> >> both, with a config enabling the new behavior to prevent users from
> trying
> >> to use both mechanisms (if the config is true, the old znode is
> ignored; if
> >> the config is false, the Admin Client API returns an error indicating
> that
> >> it is not enabled)?
> >
> > Perhaps it makes sense to discuss that possibility a bit more?
> >
> > ---------
> >
> > 5. ListPartitionReassignments filtering
> >
> > I guess the thought process here is that most reassignment tools want to
> >> know about all the reassignments that are going on.  If you don't know
> all
> >> the pending reassignments, then it's hard to say whether adding a new
> one
> >> is a good idea, or cancelling an existing one.  So I guess I can't
> think of
> >> a case where a reassignment tool would want a partial set rather than
> the
> >> full one.
> >
> >
> > I agree with Jason about the UIs having "drill into" options. I believe
> we
> > should support some sort of ongoing reassignment filtering at the topic
> > level (that's the administrative concept people care about).
> > An example of a tool that might leverage it is our own
> > kafka-reassign-partitions.sh. You can ask that tool to generate a
> > reassignment for you from a given list of topics. It currently uses
> > `KafkaZkClient#getReplicaAssignmentForTopics()` to get the current
> > assignment for the given topics. It would be better if it could use the
> new
> > ListPartitionsReassignments API to both figure out the current replica
> > assignments and whether or not those topics are being reassigned (it
> could
> > log a warning that a reassignment is in progress for those topics).
> >
> > ---------
> >
> > and a small nit: We also need to update
> > the ListPartitionReassignmentsResponse with the decided
> > current/targetReplicas naming
> >
> > Thanks,
> > Stanislav
> >
>


-- 
Best,
Stanislav

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message