kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gustafson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-7408) Truncate to LSO on unclean leader election
Date Thu, 13 Sep 2018 20:44:00 GMT
Jason Gustafson created KAFKA-7408:
--------------------------------------

             Summary: Truncate to LSO on unclean leader election
                 Key: KAFKA-7408
                 URL: https://issues.apache.org/jira/browse/KAFKA-7408
             Project: Kafka
          Issue Type: Improvement
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


If an unclean leader is elected, we may lose committed transaction data. That alone is expected,
but what is worse is that a transaction which was previously completed (either committed or
aborted) may lose its marker and become dangling. The transaction coordinator will not know
about the unclean leader election, so will not know to resend the transaction markers. Consumers
with read_committed isolation will be stuck because the LSO cannot advance.

To keep this scenario from occurring, it would be better to have the unclean leader truncate
to the LSO so that there are no dangling transactions. Truncating to the LSO is not alone
sufficient because the markers which allowed the LSO advancement may be at higher offsets.
What we can do is let the newly elected leader truncate to the LSO and then rewrite all the
markers that followed it using its own leader epoch (to avoid divergence from followers).

The interesting cases when an unclean leader election occurs are are when a transaction is
ongoing. 

1. If a producer is in the middle of a transaction commit, then the coordinator may still
attempt to write transaction markers. This will either succeed or fail depending on the producer
epoch in the unclean leader. If the epoch matches, then the WriteTxnMarker call will succeed,
which will simply be ignored by the consumer. If the epoch doesn't match, the WriteTxnMarker
call will fail and the transaction coordinator can potentially remove the partition from the
transaction.

2. If a producer is still writing the transaction, then what happens depends on the producer
state in the unclean leader. If no producer state has been lost, then the transaction can
continue without impact. Otherwise, the producer will likely fail with an OUT_OF_ORDER_SEQUENCE
error, which will cause the transaction to be aborted by the coordinator. That takes us back
to the first case.

By truncating the LSO, we ensure that transactions are either preserved in whole or they are
removed from the log in whole. For an unclean leader election, that's probably as good as
we can do. But we are ensured that consumers will not be blocked by dangling transactions.
The only remaining situation where a dangling transaction might be left is if one of the transaction
state partitions has an unclean leader election.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message