kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gustafson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower
Date Thu, 14 Dec 2017 03:05:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290268#comment-16290268
] 

Jason Gustafson commented on KAFKA-6361:
----------------------------------------

Unclean leader election was disabled. It may not have been a session expiration that caused
B to become leader (I supposed this, but it's not clear in the logs and I haven't seen controller
logs yet).  In any case, when broker B took over, broker A was still in the ISR. Broker B
appended the entry as described above and then attempted to shrink the ISR, but it failed
to do so because of an invalid cached zk version. Broker A had already become leader at that
point.

> Fast leader fail over can lead to log divergence between leader and follower
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-6361
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6361
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>
> We have observed an edge case in the replication failover logic which can cause a replica
to permanently fall out of sync with the leader or, in the worst case, actually have localized
divergence between logs. This occurs in spite of the improved truncation logic from KIP-101.

> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It appends two
batches: one in the range (0, 10) and the other in the range (11, 20). The first one successfully
replicates to B, but the second one does not. In other words, the logs on the brokers look
like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 2. It appends
a new batch with offsets (11, n) to its local log. So we now have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the follower, but before
it is able to do so, broker B has its own zk session expiration and broker A again becomes
leader, now with epoch 3. It then appends a new entry in the range (21, 30). The updated logs
look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in epoch 2. On
becoming follower, broker B will send an OffsetForLeaderEpoch request to broker A with epoch
2. Broker A will respond that epoch 2 ends at offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin fetching
from offset n, which will ultimately cause an out of order offset error because broker A will
return the full batch beginning from offset 11 which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 and everything
will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in the middle
of the batch, it will truncate all the way to offset 10. It can begin fetching from offset
11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely go unnoticed
in practice and everything is fine in the third case. To workaround the issue, we deleted
the active segment on the replica which allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't aware of
an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} instead of using the offset
of the next highest epoch. That would cause the follower to truncate using its high watermark.
Or perhaps instead of doing so, it could send another OffsetForLeaderEpoch request at the
next previous cached epoch and then truncate using that. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message