kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Francesco vigotti (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
Date Fri, 13 Oct 2017 08:48:01 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203198#comment-16203198
] 

Francesco vigotti commented on KAFKA-2729:
------------------------------------------

I'm having the same issue and definitely losing trust in kafka, every 2 months there is something
that force me to reset the whole cluster, I'm searching for a good alternative for a distributed-persisted-fast-queue
for a while.. yet to find something that give me a good vibe.. 

anyway I'm facing this same issue with some small differences
- restarting all brokers ( together and rolling-restart ) didn't fix it..

all brokers in the cluster log such errors :
--- broker 5 

{code:java}

[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,17]
to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,23]
to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,47]
to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,29]
to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

{code}

--- broker3

)
{code:java}

[2017-10-13 08:13:58,547] INFO Partition [__consumer_offsets,20] on broker 3: Expanding ISR
for partition __consumer_offsets-20 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,551] INFO Partition [__consumer_offsets,44] on broker 3: Expanding ISR
for partition __consumer_offsets-44 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,554] INFO Partition [__consumer_offsets,5] on broker 3: Expanding ISR
for partition __consumer_offsets-5 from 2,3 to 2,3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,557] INFO Partition [__consumer_offsets,26] on broker 3: Expanding ISR
for partition __consumer_offsets-26 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,563] INFO Partition [__consumer_offsets,29] on broker 3: Expanding ISR
for partition __consumer_offsets-29 from 2,3 to 2,3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,566] INFO Partition [__consumer_offsets,32] on broker 3: Expanding ISR
for partition __consumer_offsets-32 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,570] INFO Partition [legacyJavaVarT,2] on broker 3: Expanding ISR for
partition legacyJavaVarT-2 from 3 to 3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,573] INFO Partition [test4,3] on broker 3: Expanding ISR for partition
test4-3 from 2,3 to 2,3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,577] INFO Partition [test4,0] on broker 3: Expanding ISR for partition
test4-0 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,582] INFO Partition [test3,5] on broker 3: Expanding ISR for partition
test3-5 from 3 to 3,5 (kafka.cluster.Partition)

{code}


--- broker2 

{code:java}

[2017-10-13 08:13:36,289] INFO Partition [__consumer_offsets,11] on broker 2: Expanding ISR
for partition __consumer_offsets-11 from 2,5 to 2,5,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,293] INFO Partition [__consumer_offsets,41] on broker 2: Expanding ISR
for partition __consumer_offsets-41 from 2,5 to 2,5,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,296] INFO Partition [test3,2] on broker 2: Expanding ISR for partition
test3-2 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,300] INFO Partition [__consumer_offsets,23] on broker 2: Expanding ISR
for partition __consumer_offsets-23 from 2,5 to 2,5,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,304] INFO Partition [__consumer_offsets,5] on broker 2: Expanding ISR
for partition __consumer_offsets-5 from 2,5 to 2,5,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,337] INFO Partition [__consumer_offsets,35] on broker 2: Expanding ISR
for partition __consumer_offsets-35 from 2,5 to 2,5,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,372] INFO Partition [test_mainlog,24] on broker 2: Expanding ISR for
partition test_mainlog-24 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,375] INFO Partition [test_mainlog,6] on broker 2: Expanding ISR for partition
test_mainlog-6 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,379] INFO Partition [test_mainlog,18] on broker 2: Expanding ISR for
partition test_mainlog-18 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,384] INFO Partition [test_mainlog,0] on broker 2: Expanding ISR for partition
test_mainlog-0 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,388] INFO Partition [test_mainlog,12] on broker 2: Expanding ISR for
partition test_mainlog-12 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:40,367] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
__consumer_offsets-47 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,367] INFO Truncating log __consumer_offsets-47 to offset 0. (kafka.log.Log)
[2017-10-13 08:13:40,374] INFO [ReplicaFetcherThread-0-3], Starting  (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,374] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([__consumer_offsets-47, initOffset 0 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,376] ERROR [ReplicaFetcherThread-0-3], Error for partition [__consumer_offsets,47]
to broker 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,393] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
__consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,393] INFO Truncating log __consumer_offsets-29 to offset 0. (kafka.log.Log)
[2017-10-13 08:13:40,402] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,403] ERROR [ReplicaFetcherThread-0-3], Error for partition [__consumer_offsets,29]
to broker 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,407] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
__consumer_offsets-41 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,407] INFO Truncating log __consumer_offsets-41 to offset 0. (kafka.log.Log)
[2017-10-13 08:13:40,413] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([__consumer_offsets-41, initOffset 0 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,414] ERROR [ReplicaFetcherThread-0-3], Error for partition [__consumer_offsets,41]
to broker 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:40,419] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
test_mainlog-6 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,419] INFO Truncating log test_mainlog-6 to offset 4997933406. (kafka.log.Log)
[2017-10-13 08:13:40,425] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([test_mainlog-6, initOffset 4997933406 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,432] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
__consumer_offsets-17 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,432] INFO Truncating log __consumer_offsets-17 to offset 0. (kafka.log.Log)
[2017-10-13 08:13:40,438] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([__consumer_offsets-17, initOffset 0 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,443] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
test_mainlog-0 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,443] INFO Truncating log test_mainlog-0 to offset 5704085814. (kafka.log.Log)
[2017-10-13 08:13:40,449] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([test_mainlog-0, initOffset 5704085814 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,464] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions
__consumer_offsets-14 (kafka.server.ReplicaFetcherManager)
[2017-10-13 08:13:40,464] INFO Truncating log __consumer_offsets-14 to offset 0. (kafka.log.Log)
[2017-10-13 08:13:40,472] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions
List([__consumer_offsets-14, initOffset 0 to broker BrokerEndPoint(3,--hidden----.73,9092)]
) (kafka.server.ReplicaFetcherManager)

{code}







those logs goes for hours and the cluster never recover, the only things that change something
is when I repeatedly from zookeeper
delete /controller # repeatedly untill it get assigned to kafka3 node 

and at this point all errors stop ( no more error logs ) , kafka seems working, kafkamanager
show offsets for all partitions ( while some offset was missing ) , data ingestion /consumption
works , the only things that presages something wrong is that on one topic with 30 partitions
and replication 2 there is 1 broker skew ( 1 broker have 1 partitions more than normal and
one broker have 1 partition less than normal )
and the situation remain stable with this small anomaly for hours..  nodes delete indexes,
delete segments , roll new segments.. 

If i now delete the controller again, or restart the kafka3-node evreything goes to the previous
situation again ( all errors logged ) and at this point I don't even know how to recover ,
the only "fix" I'm left to try is to wipe the whole cluster data and restart  :( but what
to do then if this happens again in future ?

I don't know why two nodes seems to have a ("broken controller" ??) and the cluster remain
in this in-consistent state forever.. 
If you have any suggestion... on what to inspect / how to try to fix , those are very welcomed..

Thank you,
Francesco



> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, we started
seeing a large number of undereplicated partitions. The zookeeper cluster recovered, however
we continued to see a large number of undereplicated partitions. Two brokers in the kafka
cluster were showing this in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Shrinking ISR for partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition [__samza_checkpoint_event-creation_1,3] on broker
5: Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered after a restart.
Our own investigation yielded nothing, I was hoping you could shed some light on this issue.
Possibly if it's related to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're
using 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message