kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guozhang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-1134) onControllerFailover function should be synchronized with other functions
Date Thu, 05 Dec 2013 19:23:35 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840449#comment-13840449
] 

Guozhang Wang commented on KAFKA-1134:
--------------------------------------

After checking the stack trace again, now I think the problem is that

1) In KafkaController.handleNewSession

controllerContext.controllerLock synchronized {
        Utils.unregisterMBean(KafkaController.MBeanName)
        partitionStateMachine.shutdown()
        replicaStateMachine.shutdown()
        if(controllerContext.controllerChannelManager != null) {
          controllerContext.controllerChannelManager.shutdown()
          controllerContext.controllerChannelManager = null
        }
        controllerElector.elect
      }

elect function is called directly after controllerChannelManager.shutdown and is lock covered
by controllerContext.controllerLock, however from the logs. elect is not immediately called
since addpartition listener gets triggered due to ZK expiration (known issue similar as KAFKA-1143)
and which are covered by the same lock:

2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread], Stopped

2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread], Shutdown
completed
2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Shutting
down
2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Stopped

2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Shutdown
completed
2013/11/14 00:00:24.603 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on
Controller 583]: Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589
2013/11/14 00:00:24.605 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on
Controller 583]: Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589
2013/11/14 00:00:24.614 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener
on 583]: Add Partition triggered { "partitions":{ "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[
581, 577 ], "3":[ 583, 579 ] }, "version":1 } for path /brokers/topics/databus2-relay-log_event
2013/11/14 00:00:24.616 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener
on 583]: New partitions to be added [Map()]
2013/11/14 00:00:24.616 [KafkaController] [Controller 583]: New partition creation callback
for 
2013/11/14 00:00:24.618 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener
on 583]: Add Partition triggered { "partitions":{ "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[
581, 577 ], "3":[ 583, 579 ] }, "version":1 } for path /brokers/topics/databus2-relay-log_event

----------------

Without other logging info I cannot deduce any further, so I propose in this jira we just
improve the logging info for better debugging if this issue comes up in the future.

> onControllerFailover function should be synchronized with other functions
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-1134
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1134
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.8.1
>            Reporter: Guozhang Wang
>         Attachments: KAFKA-1134.patch, KAFKA-1134_2013-12-05_11:13:33.patch
>
>
> Otherwise race conditions could happen. For example, handleNewSession will close all
sockets with brokers while the handleStateChange in onControllerFailover tries to send requests
to them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message