kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David van Geest (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5758) Reassigning a topic's partitions can adversely impact other topics
Date Tue, 22 Aug 2017 15:25:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136933#comment-16136933

David van Geest commented on KAFKA-5758:

[~ijuma], thanks for the response!

I'm not sure I understand the distinction between my option 1 and your option 3. In both,
we're talking about returning partial results (along with an error of sorts for the partition
that is no longer being followed) in the response to `FetchRequest` right? 

> Reassigning a topic's partitions can adversely impact other topics
> ------------------------------------------------------------------
>                 Key: KAFKA-5758
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5758
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions:
>            Reporter: David van Geest
>              Labels: reliability
>             Fix For: 1.0.0
> We've noticed that reassigning a topic's partitions seems to adversely impact other topics.
Specifically, followers for other topics fall out of the ISR.
> While I'm not 100% sure about why this happens, the scenario seems to be as follows:
> 1. Reassignment is manually triggered on topic-partition X-Y, and broker A (which used
to be a follower for X-Y) is no longer a follower.
> 2. Broker A makes `FetchRequest` including topic-partition X-Y to broker B, just after
the reassignment.
> 3. Broker B can fulfill the `FetchRequest`, but while trying to do so it tries to record
the position of "follower" A. This fails, because broker A is no longer a follower for X-Y
(see exception below).
> 4. The entire `FetchRequest` request fails, and broker A's other followed topics start
falling behind.
> 5. Depending on the length of the reassignment, this sequence repeats.
> In step 3, we see exceptions like:
> {noformat}
> Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 46781859;
ClientId: ReplicaFetcherThread-0-1001; ReplicaId: 1006; MaxWait: 500 ms; MinBytes: 1 bytes;
MaxBytes:10485760 bytes; RequestInfo: 
> kafka.common.NotAssignedReplicaException: Leader 1001 failed to record follower 1006's
position -1 since the replica is not recognized to be one of the assigned replicas 1001,1004,1005
for partition [topic_being_reassigned,5].
> at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:249)
> 	at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:923)
> 	at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:920)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 	at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:920)
> 	at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:481)
> 	at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:534)
> 	at kafka.server.KafkaApis.handle(KafkaApis.scala:79)
> 	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Does my assessment make sense? If so, this behaviour seems problematic. A few changes
that might improve matters (assuming I'm on the right track):
> 1. `FetchRequest` should be able to return partial results
> 2. The broker fulfilling the `FetchRequest` could ignore the `NotAssignedReplicaException`,
and return results without recording the not-any-longer-follower position.
> This behaviour was experienced with, although looking at the changelogs and
the code in question, I don't see any reason why it would have changed in later versions.
> Am very interested to have some discussion on this. Thanks!

This message was sent by Atlassian JIRA

View raw message