kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Håkon Åmdal (JIRA) <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.
Date Thu, 22 Feb 2018 10:47:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372644#comment-16372644

Håkon Åmdal commented on KAFKA-4477:

We ended up running our own build of Kafka 0.11.0 where we cherry picked these commits:
 * KAFKA-6042: Avoid deadlock between two groups with delayed operations

 * KAFKA-6003; Accept appends on replicas and when rebuilding the log unconditionally

 * KAFKA-5970; Use ReentrantLock for delayed operation lock to avoid blocking

On the top of my head, I cannot remember which commit exactly solved this problem, but we've
run without issues since November 2017.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership,
cluster remains sick until node is restarted.
> --------------------------------------------------------------------------------------------------------------------------------------
>                 Key: KAFKA-4477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4477
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions:
>         Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>            Reporter: Michael Andre Pearce
>            Assignee: Apurva Mehta
>            Priority: Critical
>              Labels: reliability
>             Fix For:
>         Attachments: 2016_12_15.zip, 72_Server_Thread_Dump.txt, 73_Server_Thread_Dump.txt,
74_Server_Thread_Dump, issue_node_1001.log, issue_node_1001_ext.log, issue_node_1002.log,
issue_node_1002_ext.log, issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, server_1_72server.log,
server_2_73_server.log, server_3_74Server.log, state_change_controller.tar.gz
> We have encountered a critical issue that has re-occured in different physical environments.
We haven't worked out what is going on. We do though have a nasty work around to keep service
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the ISR's down to
itself, moments later we see other nodes having disconnects, followed by finally app issues,
where producing to these partitions is blocked.
> It seems only by restarting the kafka instance java process resolves the issues.
> We have had this occur multiple times and from all network and machine monitoring the
machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition [com_ig_trade_v1_position_event--demo--compacted,10]
on broker 7: Shrinking ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10]
from 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42
> java.io.IOException: Connection to 7 was disconnected before the response was read
> All clients:
> java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException:
The server disconnected before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing amount of close_waits
and file descriptors.
> As a work around to keep service we are currently putting in an automated process that
tails and regex's for: and where new_partitions hit just itself we restart the node. 
> "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for partition
\[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+) \(kafka.cluster.Partition\)"

This message was sent by Atlassian JIRA

View raw message