kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neha Narkhede (JIRA)" <j...@apache.org>
Subject [jira] [Work started] (KAFKA-1097) Race condition while reassigning low throughput partition leads to incorrect ISR information in zookeeper
Date Fri, 01 Nov 2013 17:27:22 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Work on KAFKA-1097 started by Neha Narkhede.

> Race condition while reassigning low throughput partition leads to incorrect ISR information
in zookeeper 
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1097
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1097
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>            Priority: Critical
>             Fix For: 0.8.1
>
>         Attachments: KAFKA-1097.patch, KAFKA-1097_2013-10-29_10:49:45.patch, KAFKA-1097_2013-10-30_21:46:00.patch,
KAFKA-1097_2013-10-31_10:37:29.patch, KAFKA-1097_2013-11-01_09:55:33.patch
>
>
> While moving partitions, the controller moves the old replicas through the following
state changes -
> ONLINE -> OFFLINE -> NON_EXISTENT
> During the offline state change, the controller removes the old replica and writes the
updated ISR to zookeeper and notifies the leader. Note that it doesn't notify the old replicas
to stop fetching from the leader (to be fixed in KAFKA-1032). During the non-existent state
change, the controller does not write the updated ISR or replica list to zookeeper. Right
after the non-existent state change, the controller writes the new replica list to zookeeper,
but does not update the ISR. So an old replica can send a fetch request after the offline
state change, essentially letting the leader add it back to the ISR. The problem is that if
there is no new data coming in for the partition and the old replica is fully caught up, the
leader cannot remove it from the ISR. That lets a non existent replica live in the ISR at
least until new data comes in to the partition



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message