kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mathieu Filotto (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (KAFKA-2143) Replicas get ahead of leader and fail
Date Fri, 10 Feb 2017 05:02:41 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860712#comment-15860712
] 

Mathieu Filotto edited comment on KAFKA-2143 at 2/10/17 5:02 AM:
-----------------------------------------------------------------

Is the upgrade the only option ?
Is 0.10 upgrade working too ?


was (Author: mathieu.filotto):
Is the upgrade the only option ?
Is 0.10 upgrade is working too ?

> Replicas get ahead of leader and fail
> -------------------------------------
>
>                 Key: KAFKA-2143
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2143
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.8.2.1
>            Reporter: Evan Huus
>            Assignee: Jiangjie Qin
>             Fix For: 0.9.0.1
>
>
> On a cluster of 6 nodes, we recently saw a case where a single under-replicated partition
suddenly appeared, replication lag spiked, and network IO spiked. The cluster appeared to
recover eventually on its own,
> Looking at the logs, the thing which failed was partition 7 of the topic {{background_queue}}.
It had an ISR of 1,4,3 and its leader at the time was 3. Here are the interesting log lines:
> On node 3 (the leader):
> {noformat}
> [2015-04-23 16:50:05,879] ERROR [Replica Manager on Broker 3]: Error when processing
fetch request for partition [background_queue,7] offset 3722949957 from follower with correlation
id 148185816. Possible cause: Request for offset 3722949957 but we only have log segments
in the range 3648049863 to 3722949955. (kafka.server.ReplicaManager)
> [2015-04-23 16:50:05,879] ERROR [Replica Manager on Broker 3]: Error when processing
fetch request for partition [background_queue,7] offset 3722949957 from follower with correlation
id 156007054. Possible cause: Request for offset 3722949957 but we only have log segments
in the range 3648049863 to 3722949955. (kafka.server.ReplicaManager)
> [2015-04-23 16:50:13,960] INFO Partition [background_queue,7] on broker 3: Shrinking
ISR for partition [background_queue,7] from 1,4,3 to 3 (kafka.cluster.Partition)
> {noformat}
> Note that both replicas suddenly asked for an offset *ahead* of the available offsets.
> And on nodes 1 and 4 (the replicas) many occurrences of the following:
> {noformat}
> [2015-04-23 16:50:05,935] INFO Scheduling log segment 3648049863 for log background_queue-7
for deletion. (kafka.log.Log) (edited)
> {noformat}
> Based on my reading, this looks like the replicas somehow got *ahead* of the leader,
asked for an invalid offset, got confused, and re-replicated the entire topic from scratch
to recover (this matches our network graphs, which show 3 sending a bunch of data to 1 and
4).
> Taking a stab in the dark at the cause, there appears to be a race condition where replicas
can receive a new offset before the leader has committed it and is ready to replicate?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message