kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenny (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
Date Thu, 17 Jul 2014 14:49:04 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064976#comment-14064976
] 

Kenny commented on KAFKA-1451:
------------------------------

This can also be caused by restarting Kafka quickly after a sigkill. I had a supervisord config
file with 'stopwaitsecs=1' and it would pretty reliably create a hung Kafka process.

> Broker stuck due to leader election race 
> -----------------------------------------
>
>                 Key: KAFKA-1451
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1451
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.1.1
>            Reporter: Maciek Makowski
>            Priority: Minor
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop while electing
leader. This can be recognised by the following line being repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{"version":1,"brokerid":1,"timestamp":"1400060079108"}]
at /controller a while back in a different session, hence I will backoff for this node to
be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same
with the ZK version included in Kafka distribution) node setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an
election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper
session of the broker happens after subscription to changes in new session, election will
be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into
infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was
written from different session, which is not true in this case; it was written from the same
session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data
changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message