kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Json Tu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4447) Controller resigned but it also acts as a controller for a long time
Date Mon, 28 Nov 2016 16:24:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702383#comment-15702383

Json Tu commented on KAFKA-4447:

[~skarface] thanks for your reply.
the latest release version,handleNewSession()'s implemention is as below,
  def handleNewSession() {
      info("ZK expired; shut down all controller components and try to re-elect")
      inLock(controllerContext.controllerLock) {

so deregisterIsrChangeNotificationListener() is also with the controllerlock. the lock is
out of the onControllerResignation(). and this is a bug which was reported at https://issues.apache.org/jira/browse/KAFKA-4360.

my version is, so it is not bugfixed,  so we can image it as below.
1. ZK expired callback queue is fired. and he get controllerLock first. then start to execute
onControllerResignation .
2. at that time IsrChangeNotificationListener、PartitionsReassignedListener and so on are
all fired very compact. 
3. then the onControllerResignation() start to exectue  de-register listeners.

as we know,the zkclient callback thread is single thread,so the listener fired after zk
expired only can be executed after handleNewSession(),
may be this is make sense.

> Controller resigned but it also acts as a controller for a long time 
> ---------------------------------------------------------------------
>                 Key: KAFKA-4447
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4447
>             Project: Kafka
>          Issue Type: Improvement
>          Components: controller
>    Affects Versions:,,,
>         Environment: Linux Os
>            Reporter: Json Tu
>         Attachments: log.tar.gz
> We have a cluster with 10 nodes,and we execute following operation as below.
> 1.we execute some topic partition reassign from one node to other 9 nodes in the cluster,
and which triggered controller.
> 2.controller invoke PartitionsReassignedListener's handleDataChange and read all partition
reassign rules from the zk path, and executed all onPartitionReassignment for all partition
that match conditions.
> 3.but the controller is expired from zk, after what some nodes of 9 nodes also expired
from zk.
> 5.then controller invoke onControllerResignation to resigned as the controller.
> we found after the controller is resigned, it acts as controller for about 3 minutes,
which can be found in my attachment.

This message was sent by Atlassian JIRA

View raw message