kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David van Geest (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-5758) Reassigning a topic's partitions can adversely impact other topics
Date Mon, 21 Aug 2017 17:12:00 GMT
David van Geest created KAFKA-5758:
--------------------------------------

             Summary: Reassigning a topic's partitions can adversely impact other topics
                 Key: KAFKA-5758
                 URL: https://issues.apache.org/jira/browse/KAFKA-5758
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.10.1.1
            Reporter: David van Geest


We've noticed that reassigning a topic's partitions seems to adversely impact other topics.
Specifically, followers for other topics fall out of the ISR.

While I'm not 100% sure about why this happens, the scenario seems to be as follows:

1. Reassignment is manually triggered on topic-partition X-Y, and broker A (which used to
be a follower for X-Y) is no longer a follower.
2. Broker A makes `FetchRequest` including topic-partition X-Y to broker B, just after the
reassignment.
3. Broker B can fulfill the `FetchRequest`, but while trying to do so it tries to record the
position of "follower" A. This fails, because broker A is no longer a follower for X-Y (see
exception below).
4. The entire `FetchRequest` request fails, and broker A's other followed topics start falling
behind.
5. Depending on the length of the reassignment, this sequence repeats.

In step 3, we see exceptions like:

{noformat}
Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 46781859; ClientId:
ReplicaFetcherThread-0-1001; ReplicaId: 1006; MaxWait: 500 ms; MinBytes: 1 bytes; MaxBytes:10485760
bytes; RequestInfo: 

<LOTS OF PARTITIONS>

kafka.common.NotAssignedReplicaException: Leader 1001 failed to record follower 1006's position
-1 since the replica is not recognized to be one of the assigned replicas 1001,1004,1005 for
partition [topic_being_reassigned,5].
{noformat}

Does my assessment make sense? If so, this behaviour seems problematic. A few changes that
might improve matters (assuming I'm on the right track):

1. `FetchRequest` should be able to return partial results
2. The broker fulfilling the `FetchRequest` could ignore the `NotAssignedReplicaException`,
and return results without recording the not-any-longer-follower position.

This behaviour was experienced with 0.10.1.1, although looking at the changelogs and the code
in question, I don't see any reason why it would have changed in later versions.

Am very interested to have some discussion on this. Thanks!




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message