kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Elenskiy (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
Date Thu, 22 Jun 2017 21:36:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060049#comment-16060049
] 

Andrey Elenskiy edited comment on KAFKA-2729 at 6/22/17 9:35 PM:
-----------------------------------------------------------------

Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an election which
caused some sessions to expire with:

{{2017-06-22 02:07:36,092 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373]
- Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running}}

which caused controller resignation:

{{[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK expired; shut down
all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, broker id 158980
(kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering IsrChangeNotificationListener
(kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: Stopped partition
state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: Stopped replica
state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as the controller
(kafka.controller.KafkaController)}}

and after that just kept getting this in broker's server logs for next 8 hours until just
restarting manually it:

{{[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR for partition
[A,5] from 158980,133641,155394 to 158980 (kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached zkVersion [73] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)}}



was (Author: timoha):
Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an election which
caused some sessions to expire with:
```
2017-06-22 02:07:36,092 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373]
- Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
```
which caused controller resignation:
```
[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK expired; shut down
all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, broker id 158980
(kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering IsrChangeNotificationListener
(kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: Stopped partition
state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: Stopped replica
state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as the controller
(kafka.controller.KafkaController)
```
and after that just kept getting this in broker's server logs for next 8 hours until just
restarting manually it:
```
[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR for partition
[A,5] from 158980,133641,155394 to 158980 (kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached zkVersion [73] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
```

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, we started
seeing a large number of undereplicated partitions. The zookeeper cluster recovered, however
we continued to see a large number of undereplicated partitions. Two brokers in the kafka
cluster were showing this in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Shrinking ISR for partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered after a restart.
Our own investigation yielded nothing, I was hoping you could shed some light on this issue.
Possibly if it's related to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're
using 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message