kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Rao (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop
Date Wed, 28 Aug 2013 15:34:52 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jun Rao reassigned KAFKA-1029:
------------------------------

    Assignee: Sam Meder  (was: Neha Narkhede)
    
> Zookeeper leader election stuck in ephemeral node retry loop
> ------------------------------------------------------------
>
>                 Key: KAFKA-1029
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1029
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8
>            Reporter: Sam Meder
>            Assignee: Sam Meder
>            Priority: Blocker
>             Fix For: 0.8
>
>         Attachments: 0002-KAFKA-1029-Use-brokerId-instead-of-leaderId-when-tri.patch
>
>
> We're seeing the following log statements (over and over):
> [2013-08-27 07:21:49,538] INFO conflict in /controller data: { "brokerid":3, "timestamp":"1377587945206",
"version":1 } stored data: { "brokerid":2, "timestamp":"1377587460904", "version":1 } (kafka.utils.ZkUtils$)
> [2013-08-27 07:21:49,559] INFO I wrote this conflicted ephemeral node [{ "brokerid":3,
"timestamp":"1377587945206", "version":1 }] at /controller a while back in a different session,
hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> where the broker is essentially stuck in the loop that is trying to deal with left-over
ephemeral nodes. The code looks a bit racy to me. In particular:
> ZookeeperLeaderElector:
>   def elect: Boolean = {
>     controllerContext.zkClient.subscribeDataChanges(electionPath, leaderChangeListener)
>     val timestamp = SystemTime.milliseconds.toString
>     val electString = ...
>     try {
>       createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, electionPath,
electString, leaderId,
>         (controllerString : String, leaderId : Any) => KafkaController.parseControllerId(controllerString)
== leaderId.asInstanceOf[Int],
>         controllerContext.zkSessionTimeout)
> leaderChangeListener is registered before the create call (by the way, it looks like
a new registration will be added every elect call - shouldn't it register in startup()?) so
can update leaderId to the current leader before the call to create. If that happens then
we will continuously get node exists exceptions and the checker function will always return
true, i.e. we will never get out of the while(true) loop.
> I think the right fix here is to pass brokerId instead of leaderId when calling create,
i.e.
> createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, electionPath,
electString, brokerId,
>         (controllerString : String, leaderId : Any) => KafkaController.parseControllerId(controllerString)
== leaderId.asInstanceOf[Int],
>         controllerContext.zkSessionTimeout)
> The loop dealing with the ephemeral node bug is now only triggered for the broker that
owned the node previously, although I am still not 100% sure if that is sufficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message