kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Rao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
Date Thu, 13 Apr 2017 15:09:41 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967705#comment-15967705

Jun Rao commented on KAFKA-2729:

Thanks for the additional info. In both [~Ronghua Lin] and [~allenzhuyi]'s case, it seems
ZK session expiration had happened. As I mentioned earlier in the jira, there is a known issue
reported in KAFKA-3083 that when the controller's ZK session expires and loses its controller-ship,
it's possible for this zombie controller to continue updating ZK and/or sending LeaderAndIsrRequests
to the brokers for a short period of time. When this happens, the broker may not have the
most up-to-date information about leader and isr, which can lead to subsequent ZK failure
when isr needs to be updated.

It may take some time to have this issue fixed. In the interim, the workaround for this issue
is to make sure ZK session expiration never happens. This first thing is to figure out what's
causing the ZK session to expire. Two common causes are (1) long broker GC and (2) network
glitches. For (1), one needs to tune the GC in the broker properly. For (2), one can look
at the reported time that the ZK client can't hear from the ZK server and increase the ZK
session expiration time according.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions:
>            Reporter: Danil Serdyuchenko
> After a small network wobble where zookeeper nodes couldn't reach each other, we started
seeing a large number of undereplicated partitions. The zookeeper cluster recovered, however
we continued to see a large number of undereplicated partitions. Two brokers in the kafka
cluster were showing this in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Shrinking ISR for partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered after a restart.
Our own investigation yielded nothing, I was hoping you could shed some light on this issue.
Possibly if it's related to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're

This message was sent by Atlassian JIRA

View raw message