kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guozhang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4447) Controller resigned but it also acts as a controller for a long time
Date Mon, 28 Nov 2016 17:50:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702635#comment-15702635

Guozhang Wang commented on KAFKA-4447:

My two cents here:

1. ZkClient's causing our listener function to be executed in different threads and hence
we have seen lots of race conditions in the controller. I believe [~onurkaraman] [~becket_qin]
have been working on re-writing the controller to fix such multi-threading race conditions
as a whole. And after that such issues should be fixed. Before that happens I think a simple
check as Jason mentioned before may not be sufficient, since it could happen that when the
listener thread does the check it is still not resigned, but while it is executing the resignation
happens. I think it does not do effective harm as its requests to other brokers should be
rejected because of the obsoleted epoch number, but the 3 minutes log swamp could be irritating.

2. The partition assignment could take long time with large number of partitions to migrate,
and particularly in this case it lasts 3 minutes acting as the controller even after it has
resigned because of thread racing. There are already some optimizations submitted by [~lindong]
to shorten this latency, but I think we also need to consider how to handle such "long task"
under a single-threaded model.

> Controller resigned but it also acts as a controller for a long time 
> ---------------------------------------------------------------------
>                 Key: KAFKA-4447
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4447
>             Project: Kafka
>          Issue Type: Improvement
>          Components: controller
>    Affects Versions:,,,
>         Environment: Linux Os
>            Reporter: Json Tu
>         Attachments: log.tar.gz
> We have a cluster with 10 nodes´╝îand we execute following operation as below.
> 1.we execute some topic partition reassign from one node to other 9 nodes in the cluster,
and which triggered controller.
> 2.controller invoke PartitionsReassignedListener's handleDataChange and read all partition
reassign rules from the zk path, and executed all onPartitionReassignment for all partition
that match conditions.
> 3.but the controller is expired from zk, after what some nodes of 9 nodes also expired
from zk.
> 5.then controller invoke onControllerResignation to resigned as the controller.
> we found after the controller is resigned, it acts as controller for about 3 minutes,
which can be found in my attachment.

This message was sent by Atlassian JIRA

View raw message