kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gustafson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-6361) Fast leader fail over can lead to log divergence between replica and follower
Date Thu, 14 Dec 2017 00:46:20 GMT
Jason Gustafson created KAFKA-6361:
--------------------------------------

             Summary: Fast leader fail over can lead to log divergence between replica and
follower
                 Key: KAFKA-6361
                 URL: https://issues.apache.org/jira/browse/KAFKA-6361
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


We have observed an edge case in the replication failover logic which can cause a replica
to permanently fall out of sync with the leader or, in the worst case, actually have localized
divergence between logs. This occurs in spite of the improved truncation logic from KIP-101.


Suppose we have brokers A and B. Initially A is the leader in epoch 1. It appends two batches:
one in the range (0, 10) and the other in the range (11, 20). The first one successfully replicates
to B, but the second one does not. In other words, the logs on the brokers look like this:

{code}
Broker A:
0: offsets [0, 10], leader epoch: 1
1: offsets [11, 20], leader epoch: 1

Broker B:
0: offsets [0, 10], leader epoch: 1
{code}

Broker A then has a zk session expiration and broker B is elected with epoch 2. It appends
a new batch with offsets (11, n) to its local log. So we now have this:

{code}
Broker A:
0: offsets [0, 10], leader epoch: 1
1: offsets [11, 20], leader epoch: 1

Broker B:
0: offsets [0, 10], leader epoch: 1
1: offsets: [11, n], leader epoch: 2
{code}

Normally we expect broker A to truncate to offset 11 on becoming the follower, but before
it is able to do so, broker B has its own zk session expiration and broker A again becomes
leader, now with epoch 3. It then appends a new entry in the range (21, 30). The updated logs
look like this:

{code}
Broker A:
0: offsets [0, 10], leader epoch: 1
1: offsets [11, 20], leader epoch: 1
2: offsets: [21, 30], leader epoch: 3

Broker B:
0: offsets [0, 10], leader epoch: 1
1: offsets: [11, n], leader epoch: 2
{code}

Now what happens next depends on the last offset of the batch appended in epoch 2. On becoming
follower, broker B will send an OffsetForLeaderEpoch request to broker A with epoch 2. Broker
A will respond that epoch 2 ends at offset 21. There are three cases:

1) n < 20: In this case, broker B will not do any truncation. It will begin fetching from
offset n, which will ultimately cause an out of order offset error because broker A will return
the full batch beginning from offset 11 which broker B will be unable to append.

2) n == 20: Again broker B does not truncate. It will fetch from offset 21 and everything
will appear fine though the logs have actually diverged.

3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in the middle
of the batch, it will truncate all the way to offset 10. It can begin fetching from offset
11 and everything is fine.

The case we have actually seen is the first one. The second one would likely go unnoticed
in practice and everything is fine in the third case. To workaround the issue, we deleted
the active segment on the replica which allowed it to re-replicate consistently from the leader.

I'm not sure the best solution for this scenario. Maybe if the leader isn't aware of an epoch,
it should always respond with {{UNDEFINED_EPOCH_OFFSET}} instead of using the offset of the
next highest epoch. That would cause the follower to truncate using its high watermark. Or
perhaps instead of doing so, it could send another OffsetForLeaderEpoch request at the next
previous cached epoch and then truncate using that. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message