Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5CC5E200CFD for ; Wed, 6 Sep 2017 12:29:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5B7B6160BCB; Wed, 6 Sep 2017 10:29:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A213A1609ED for ; Wed, 6 Sep 2017 12:29:05 +0200 (CEST) Received: (qmail 92703 invoked by uid 500); 6 Sep 2017 10:29:03 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 92687 invoked by uid 99); 6 Sep 2017 10:29:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Sep 2017 10:29:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 687C1CC2F3 for ; Wed, 6 Sep 2017 10:29:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id uGwoUUo--_H2 for ; Wed, 6 Sep 2017 10:29:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 448BF5FD33 for ; Wed, 6 Sep 2017 10:29:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B64C1E0ED3 for ; Wed, 6 Sep 2017 10:29:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 26B4324157 for ; Wed, 6 Sep 2017 10:29:01 +0000 (UTC) Date: Wed, 6 Sep 2017 10:29:01 +0000 (UTC) From: "Viktor Somogyi (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 06 Sep 2017 10:29:06 -0000 [ https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155156#comment-16155156 ] Viktor Somogyi commented on KAFKA-4477: --------------------------------------- Could anyone can help me please with sharing the commit hash related to this jira? I'd like to look into the fix but couldn't find any related commits in the git history. > Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted. > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-4477 > URL: https://issues.apache.org/jira/browse/KAFKA-4477 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.1.0 > Environment: RHEL7 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) > Reporter: Michael Andre Pearce > Assignee: Apurva Mehta > Priority: Critical > Labels: reliability > Fix For: 0.10.1.1 > > Attachments: 2016_12_15.zip, 72_Server_Thread_Dump.txt, 73_Server_Thread_Dump.txt, 74_Server_Thread_Dump, issue_node_1001_ext.log, issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, server_1_72server.log, server_2_73_server.log, server_3_74Server.log, state_change_controller.tar.gz > > > We have encountered a critical issue that has re-occured in different physical environments. We haven't worked out what is going on. We do though have a nasty work around to keep service alive. > We do have not had this issue on clusters still running 0.9.01. > We have noticed a node randomly shrinking for the partitions it owns the ISR's down to itself, moments later we see other nodes having disconnects, followed by finally app issues, where producing to these partitions is blocked. > It seems only by restarting the kafka instance java process resolves the issues. > We have had this occur multiple times and from all network and machine monitoring the machine never left the network, or had any other glitches. > Below are seen logs from the issue. > Node 7: > [2016-12-01 07:01:28,112] INFO Partition [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 to 7 (kafka.cluster.Partition) > All other nodes: > [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 7 was disconnected before the response was read > All clients: > java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. > After this occurs, we then suddenly see on the sick machine an increasing amount of close_waits and file descriptors. > As a work around to keep service we are currently putting in an automated process that tails and regex's for: and where new_partitions hit just itself we restart the node. > "\[(?P