From Jan Filipiak <Jan.Filip...@trivago.com>
Subject Re: [DISCUSS] KIP-455 Create an Admin API for Replica Reassignments
Date Fri, 12 Jul 2019 09:04:01 GMT
Great KIP,

pure java cruise-control would be a nice thing to have <3

I just want to ask what the opinions are on a way to get the Futures of
the assignment back. Say accross JVms.

Best Jan

On 02.07.2019 19:47, Stanislav Kozlovski wrote:
> Hey there, I need to start a new thread on KIP-455. I think there might be
> an issue with the mailing server. For some reason, my replies to the
> previous discussion thread could not be seen by others. After numerous
> attempts, Colin suggested I start a new thread.
> Original Discussion Thread:
> https://sematext.com/opensee/m/Kafka/uyzND1Yl7Er128CQu1?subj=+DISCUSS+KIP+455+Create+an+Administrative+API+for+Replica+Reassignment
> KIP:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> Last Reply of Previous Thread:
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201906.mbox/%3C679a4c5b-3da6-4556-bb89-e680d8cbb705%40www.fastmail.com%3E
> The following is my reply:
> ----
> Hi again,
> This has been a great discussion on a tricky KIP. I appreciate everybody's
> involvement in improving this crucial API.
> That being said, I wanted to apologize for my first comment, it was a bit
> rushed and not thought out.
> I've got a few questions now that I dove into this better:
> 1. Does it make sense to have an easy way to cancel all ongoing
> reassignments? To cancel all ongoing reassignments, users had the crude
> option of deleting the znode, bouncing the controller and running the
> rollback JSON assignment that kafka-reassign-partitions.sh gave them
> (KAFKA-6304).
> Now that we support multiple reassignment requests, users may add execute
> them incrementally. Suppose something goes horribly wrong and they want to
> revert as quickly as possible - they would need to run the tool with
> multiple rollback JSONs.  I think that it would be useful to have an easy
> way to stop all ongoing reassignments for emergency situations.
> ---------
> 2. Our kafka-reassign-partitions.sh tool doesn't seem to currently let you
> figure out the ongoing assignments - I guess we expect people to use
> kafka-topics.sh for that. I am not sure how well that would continue to
> work now that we update the replica set only after the new replica joins
> the ISR.
> Do you think it makes sense to add an option for listing the current
> reassignments to the reassign tool as part of this KIP?
> We might want to think whether we want to show the TargetReplicas
> information in the kafka-topics command for completeness as well. That
> might involve the need to update the DescribeTopicsResponse. Personally I
> can't see a downside but I haven't given it too much thought. I fully agree
> that we don't want to add the target replicas to the full replica set and
> nothing useful comes out of telling users they have a replica that might
> not have copied a single byte. Yet, telling them that we have the intention
> of copying bytes sounds useful so maybe having a separate column in
> kafka-topics.sh would provide better clarity?
> ---------
> 3. What happens if we do another reassignment to a partition while one is
> in progress? Do we overwrite the TargetReplicas?
> In the example sequence you gave:
> R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> What would the behavior be if a new reassign request came with
> TargetReplicas of [7, 8, 9] for that partition?
> To avoid complexity and potential race conditions, would it make sense to
> reject a reassign request once one is in progress for the specific
> partition, essentially forcing the user to cancel it first?
> Forcing the user to cancel has the benefit of being explicit and guarding
> against human mistakes. The downside I can think of is that in some
> scenarios it might be inefficient, e.g
> R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> Cancel request sent out. Followed by a new reassign request with
> TargetReplicas of [5, 6, 7] (note that 5 and 6 already fully copied the
> partition). Becomes a bit of a race condition of whether we deleted the
> partitions in between requests or not - I assume in practice this won't be
> an issue. I still feel like I prefer the explicit cancellation step
> ---------
> 4. My biggest concern - I want to better touch on the interaction between
> the new API and the current admin/reassign_partitions znode, the
> compatibility and our strategy there.
> The KIP says:
>> For compatibility purposes, we will continue to allow assignments to be
>> submitted through the /admin/reassign_partitions node. Just as with the
>> current code, this will only be possible if there are no current
>> assignments. In other words, the znode has two states: empty and waiting
>> for a write, and non-empty because there are assignments in progress. Once
>> the znode is non-empty, further writes to it will be ignored.
> Given the current proposal, I can think of 4 scenarios I want to get a
> better understanding of:
> *(i, ii, iii, iiii talk about the reassignment of the same one partition
> only - partitionA)*
> i. znode is empty, new reassignment triggered via API, znode is updated
> When the new reassignment is triggered via the API, do we create the znode
> or do we allow a separate tool to trigger another reassignment through it?
> ii. (assuming we allow creating the znode as with scenario "i"): znode is
> empty, new reassignment triggered via API, znode is updated, znode is
> My understand is that deleting the znode does not do anything until the
> Controller is bounced - is that correct?
> If so, this means that nothing will happen. If the Controller is bounced,
> the reassignment state will still be live in the [partitionId]/state znode
> iii. znode is updated, new reassignment triggered via API
> We override the reassignment for partitionA. The reassign_partitions znode
> is showing stale data, correct?
> iiii. znode is updated, new reassignment triggered via API, controller
> failover
> What does the controller believe - the [partitionId]/state znode or the
> /reassign_partitions ? I would assume the [partitionId]/state znode since
> in this case we want the reassignment API call to be the correct one. I
> think that opens up the possibility of missing a freshly-set
> /reassign_partitions though (e.g if it was empty and was set right during
> controller failover)
> iiiii. znode is updated to move partitionA, new reassignment triggered via
> API for partitionB, partitionA move finishes
> At this point, do we delete the znode or do we wait until the partitionB
> move finishes as well?
> From the discussion here:
>> There's no guarantee that what is in the znode reflects the current
>> reassignments that are going on.  The only thing you can know is that if
>> the znode exists, there is at least one reassignment going on.
> This is changing the expected behavior of a tool that obeys Kafka's current
> behavior though. It is true that updating the znode while a reassignment is
> in progress has no effect but make ZK misleading but tools might have grown
> to follow that rule and only update the znode once it is empty. I think we
> might want to be more explicit when making such changes - I had seen
> discontentment in the community from the fact that we had changed the znode
> updating behavior in a MINOR pull request.
> I feel it is complex to support both APIs and make sure we don't have
> unhandled edge cases. I liked Bob's suggestion on potentially allowing only
> one via a feature flag:
>> Could we temporarily support
>> both, with a config enabling the new behavior to prevent users from trying
>> to use both mechanisms (if the config is true, the old znode is ignored; if
>> the config is false, the Admin Client API returns an error indicating that
>> it is not enabled)?
> Perhaps it makes sense to discuss that possibility a bit more?
> ---------
> 5. ListPartitionReassignments filtering
> I guess the thought process here is that most reassignment tools want to
>> know about all the reassignments that are going on.  If you don't know all
>> the pending reassignments, then it's hard to say whether adding a new one
>> is a good idea, or cancelling an existing one.  So I guess I can't think of
>> a case where a reassignment tool would want a partial set rather than the
>> full one.
> I agree with Jason about the UIs having "drill into" options. I believe we
> should support some sort of ongoing reassignment filtering at the topic
> level (that's the administrative concept people care about).
> An example of a tool that might leverage it is our own
> kafka-reassign-partitions.sh. You can ask that tool to generate a
> reassignment for you from a given list of topics. It currently uses
> `KafkaZkClient#getReplicaAssignmentForTopics()` to get the current
> assignment for the given topics. It would be better if it could use the new
> ListPartitionsReassignments API to both figure out the current replica
> assignments and whether or not those topics are being reassigned (it could
> log a warning that a reassignment is in progress for those topics).
> ---------
> and a small nit: We also need to update
> the ListPartitionReassignmentsResponse with the decided
> current/targetReplicas naming
> Thanks,
> Stanislav
