kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Debowczyk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"
Date Wed, 15 Jun 2016 13:11:09 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331709#comment-15331709
] 

Debowczyk commented on KAFKA-3410:
----------------------------------

Hi, are there plans to solve this issue? We had similar problem on our test environment. I'm
not sure what was the reason (probably kafka cluster restart). Setting unclean.leader.election.enable=true
didn't solve the problem. We had to remove broken topics data. It would be nice to have some
procedure to recover from such situation if it's not to do that automatically.

> Unclean leader election and "Halting because log truncation is not allowed"
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-3410
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3410
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: James Cheng
>
> I ran into a scenario where one of my brokers would continually shutdown, with the error
message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because log truncation
is not allowed for topic test, Current leader 1's latest offset 0 is less than replica 2's
latest offset 151 (kafka.server.ReplicaFetcherThread)
> I managed to reproduce it with the following scenario:
> 1. Start broker1, with unclean.leader.election.enable=false
> 2. Start broker2, with unclean.leader.election.enable=false
> 3. Create topic, single partition, with replication-factor 2.
> 4. Write data to the topic.
> 5. At this point, both brokers are in the ISR. Broker1 is the partition leader.
> 6. Ctrl-Z on broker2. (Simulates a GC pause or a slow network) Broker2 gets dropped out
of ISR. Broker1 is still the leader. I can still write data to the partition.
> 7. Shutdown Broker1. Hard or controlled, doesn't matter.
> 8. rm -rf the log directory of broker1. (This simulates a disk replacement or full hardware
replacement)
> 9. Resume broker2. It attempts to connect to broker1, but doesn't succeed because broker1
is down. At this point, the partition is offline. Can't write to it.
> 10. Resume broker1. Broker1 resumes leadership of the topic. Broker2 attempts to join
ISR, and immediately halts with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because log truncation
is not allowed for topic test, Current leader 1's latest offset 0 is less than replica 2's
latest offset 151 (kafka.server.ReplicaFetcherThread)
> I am able to recover by setting unclean.leader.election.enable=true on my brokers.
> I'm trying to understand a couple things:
> * In step 10, why is broker1 allowed to resume leadership even though it has no data?
> * In step 10, why is it necessary to stop the entire broker due to one partition that
is in this state? Wouldn't it be possible for the broker to continue to serve traffic for
all the other topics, and just mark this one as unavailable?
> * Would it make sense to allow an operator to manually specify which broker they want
to become the new master? This would give me more control over how much data loss I am willing
to handle. In this case, I would want broker2 to become the new master. Or, is that possible
and I just don't know how to do it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message