Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B6A56200C8B for ; Mon, 22 May 2017 14:50:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B53A3160BD5; Mon, 22 May 2017 12:50:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 05609160BA5 for ; Mon, 22 May 2017 14:50:08 +0200 (CEST) Received: (qmail 65155 invoked by uid 500); 22 May 2017 12:50:08 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 65143 invoked by uid 99); 22 May 2017 12:50:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 May 2017 12:50:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 8B08EC0B7D for ; Mon, 22 May 2017 12:50:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id UI3_2GyulMRh for ; Mon, 22 May 2017 12:50:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 6363E5FCB9 for ; Mon, 22 May 2017 12:50:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8F010E07E1 for ; Mon, 22 May 2017 12:50:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4925021B57 for ; Mon, 22 May 2017 12:50:04 +0000 (UTC) Date: Mon, 22 May 2017 12:50:04 +0000 (UTC) From: "dhiraj prajapati (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 22 May 2017 12:50:09 -0000 [ https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019536#comment-16019536 ] dhiraj prajapati edited comment on KAFKA-4477 at 5/22/17 12:49 PM: ------------------------------------------------------------------- Hi all, We have a 3-node cluster on our production environment. We recently upgraded kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of intermittent disconnection. We never had this issue in 0.9.0.1. Is this issue fixed in later versions? I am asking this because I saw a similar thread for version 10.2: https://issues.apache.org/jira/browse/KAFKA-5153 Please assist. was (Author: dhirajpraj): Hi all, We have a 3-node cluster on our production environment. We recenctly upgraded kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of intermittent disconnection. We never had this issue in 0.9.0.1. Is this issue fixed in later versions? I am asking this because I saw a similar thread for version 10.2: https://issues.apache.org/jira/browse/KAFKA-5153 Please assist. > Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted. > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-4477 > URL: https://issues.apache.org/jira/browse/KAFKA-4477 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.1.0 > Environment: RHEL7 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) > Reporter: Michael Andre Pearce > Assignee: Apurva Mehta > Priority: Critical > Labels: reliability > Fix For: 0.10.1.1 > > Attachments: 2016_12_15.zip, 72_Server_Thread_Dump.txt, 73_Server_Thread_Dump.txt, 74_Server_Thread_Dump, issue_node_1001_ext.log, issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, server_1_72server.log, server_2_73_server.log, server_3_74Server.log, state_change_controller.tar.gz > > > We have encountered a critical issue that has re-occured in different physical environments. We haven't worked out what is going on. We do though have a nasty work around to keep service alive. > We do have not had this issue on clusters still running 0.9.01. > We have noticed a node randomly shrinking for the partitions it owns the ISR's down to itself, moments later we see other nodes having disconnects, followed by finally app issues, where producing to these partitions is blocked. > It seems only by restarting the kafka instance java process resolves the issues. > We have had this occur multiple times and from all network and machine monitoring the machine never left the network, or had any other glitches. > Below are seen logs from the issue. > Node 7: > [2016-12-01 07:01:28,112] INFO Partition [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 to 7 (kafka.cluster.Partition) > All other nodes: > [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 7 was disconnected before the response was read > All clients: > java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. > After this occurs, we then suddenly see on the sick machine an increasing amount of close_waits and file descriptors. > As a work around to keep service we are currently putting in an automated process that tails and regex's for: and where new_partitions hit just itself we restart the node. > "\[(?P