kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Rao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
Date Fri, 13 Oct 2017 23:53:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204367#comment-16204367
] 

Jun Rao commented on KAFKA-2729:
--------------------------------

[~fravigotti], sorry to hear that. A couple of quick suggestions.

(1) Do you see any ZK session expiration in the log (e.g., INFO zookeeper state changed (Expired)
(org.I0Itec.zkclient.ZkClient))? There are known bugs in Kafka in handling ZK session expiration.
So, it would be useful to avoid it in the first place. Typical causes of ZK session expiration
are long GC in the broker or network glitches. So you can either tune the broker or increase
zookeeper.session.timeout.ms.

(2) Do you have lots of partitions (say a few thousands) per broker? If so, you want to check
if the controlled shutdown succeeds when shutting down a broker. If not, restarting the broker
too soon could also lead the cluster to a weird state. To address this issue, you can increase
request.timeout.ms on the broker.

We are fixing the known issue in (1) and improving the performance with lots of partitions
in (2) in KAFKA-5642 and we expect the fix to be included in the 1.1.0 release in Feb.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, we started
seeing a large number of undereplicated partitions. The zookeeper cluster recovered, however
we continued to see a large number of undereplicated partitions. Two brokers in the kafka
cluster were showing this in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Shrinking ISR for partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered after a restart.
Our own investigation yielded nothing, I was hoping you could shed some light on this issue.
Possibly if it's related to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're
using 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message