Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 148D6200C92 for ; Mon, 12 Jun 2017 15:04:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 13135160BD9; Mon, 12 Jun 2017 13:04:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 60866160BD6 for ; Mon, 12 Jun 2017 15:04:10 +0200 (CEST) Received: (qmail 10679 invoked by uid 500); 12 Jun 2017 13:04:09 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 10668 invoked by uid 99); 12 Jun 2017 13:04:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Jun 2017 13:04:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id CE5981A02C8 for ; Mon, 12 Jun 2017 13:04:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id hEy2HW8a9RCp for ; Mon, 12 Jun 2017 13:04:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 5FCD95F5B8 for ; Mon, 12 Jun 2017 13:04:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C6EE0E0DC5 for ; Mon, 12 Jun 2017 13:04:03 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2D5B021E10 for ; Mon, 12 Jun 2017 13:04:01 +0000 (UTC) Date: Mon, 12 Jun 2017 13:04:01 +0000 (UTC) From: "Bart Vercammen (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 12 Jun 2017 13:04:11 -0000 [ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046534#comment-16046534 ] Bart Vercammen commented on KAFKA-5153: --------------------------------------- [~arpan.khagram0212@gmail.com] Can you confirm that changing the default config fixed the issue for you? {noformat} replica.fetch.wait.max.ms replica.lag.time.max.ms {noformat} We're also hitting this issue continuously on our clusters, running Kafka 0.10.1.1, and also encountered it (or at least something with the same symptoms) on our Kafka 0.10.2.1 clusters. Still need to investigate more in detail what is actually triggering this, so any tips or insights would be welcome ... > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --------------------------------------------------------------------------- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 > Reporter: Arpan > Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing is similar. I tried to search the same reference in release docs as well but did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set of servers in cluster mode. We are having around 240GB of data getting transferred through KAFKA everyday. What we are observing is disconnect of the server from cluster and ISR getting reduced and it starts impacting service. > I have also observed file descriptor count getting increased a bit, in normal circumstances we have not observed FD count more than 500 but when issue started we were observing it in the range of 650-700 on all 3 servers. Attaching thread dumps of all 3 servers when we started facing the issue recently. > The issue get vanished once you bounce the nodes and the set up is not working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching server.properties as well for one of the server (It's similar on all 3 serversP) -- This message was sent by Atlassian JIRA (v6.4.14#64029)