Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 481D2200D44 for ; Mon, 20 Nov 2017 14:00:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4629F160BF9; Mon, 20 Nov 2017 13:00:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8C1DD160BEC for ; Mon, 20 Nov 2017 14:00:06 +0100 (CET) Received: (qmail 70179 invoked by uid 500); 20 Nov 2017 13:00:05 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 70168 invoked by uid 99); 20 Nov 2017 13:00:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Nov 2017 13:00:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D87A11A0827 for ; Mon, 20 Nov 2017 13:00:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id quhbcOsY2YSM for ; Mon, 20 Nov 2017 13:00:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 5E7055F6D3 for ; Mon, 20 Nov 2017 13:00:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D8737E0373 for ; Mon, 20 Nov 2017 13:00:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0BEF9240D2 for ; Mon, 20 Nov 2017 13:00:00 +0000 (UTC) Date: Mon, 20 Nov 2017 13:00:00 +0000 (UTC) From: "Robin Tweedie (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 20 Nov 2017 13:00:07 -0000 [ https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259208#comment-16259208 ] Robin Tweedie commented on KAFKA-6199: -------------------------------------- I think I found the corresponding "old client" causing this problem this morning -- might help with reproducing the issue. It is logging a similar error around the same rate as we see on the Kafka broker: {noformat} [2017-11-20 10:20:17,495] ERROR kafka:102 Unable to receive data from Kafka Traceback (most recent call last): File "/opt/kafka_offset_manager/venv/lib/python2.7/site-packages/kafka/conn.py", line 99, in _read_bytes raise socket.error("Not enough data to read message -- did server kill socket?") error: Not enough data to read message -- did server kill socket? {noformat} We have a python 2.7 process running to check topic and consumer offsets to report metrics. It was running {{kafka-python==0.9.3}} (current version is 1.3.5). We are going to do some experiments to make sure that this is the culprit of the heap growth. > Single broker with fast growing heap usage > ------------------------------------------ > > Key: KAFKA-6199 > URL: https://issues.apache.org/jira/browse/KAFKA-6199 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.2.1 > Environment: Amazon Linux > Reporter: Robin Tweedie > Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, path2gc.png > > > We have a single broker in our cluster of 25 with fast growing heap usage which necessitates us restarting it every 12 hours. If we don't restart the broker, it becomes very slow from long GC pauses and eventually has {{OutOfMemory}} errors. > See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage percentage on the broker. A "normal" broker in the same cluster stays below 50% (averaged) over the same time period. > We have taken heap dumps when the broker's heap usage is getting dangerously high, and there are a lot of retained {{NetworkSend}} objects referencing byte buffers. > We also noticed that the single affected broker logs a lot more of this kind of warning than any other broker: > {noformat} > WARN Attempting to send response via channel for which there is no open connection, connection id 13 (kafka.network.Processor) > {noformat} > See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log message visualized across all the brokers (to show it happens a bit on other brokers, but not nearly as much as it does on the "bad" broker). > I can't make the heap dumps public, but would appreciate advice on how to pin down the problem better. We're currently trying to narrow it down to a particular client, but without much success so far. > Let me know what else I could investigate or share to track down the source of this leak. -- This message was sent by Atlassian JIRA (v6.4.14#64029)