Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CFCE4200CB3 for ; Mon, 26 Jun 2017 11:59:03 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CE2EE160BF5; Mon, 26 Jun 2017 09:59:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 28E03160BD9 for ; Mon, 26 Jun 2017 11:59:03 +0200 (CEST) Received: (qmail 51833 invoked by uid 500); 26 Jun 2017 09:59:02 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 51818 invoked by uid 99); 26 Jun 2017 09:59:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jun 2017 09:59:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id F0D38C870A for ; Mon, 26 Jun 2017 09:59:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.011 X-Spam-Level: X-Spam-Status: No, score=-100.011 tagged_above=-999 required=6.31 tests=[SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Zpkc5JfdSnGC for ; Mon, 26 Jun 2017 09:59:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D93365F6C4 for ; Mon, 26 Jun 2017 09:59:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5A869E0031 for ; Mon, 26 Jun 2017 09:59:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 10F0D2193F for ; Mon, 26 Jun 2017 09:59:00 +0000 (UTC) Date: Mon, 26 Jun 2017 09:59:00 +0000 (UTC) From: "Ismael Juma (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-5512) KafkaConsumer: High memory allocation rate when idle MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 26 Jun 2017 09:59:04 -0000 [ https://issues.apache.org/jira/browse/KAFKA-5512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062859#comment-16062859 ] Ismael Juma commented on KAFKA-5512: ------------------------------------ Nice catch. Are you intending to submit a PR with your fixes? > KafkaConsumer: High memory allocation rate when idle > ---------------------------------------------------- > > Key: KAFKA-5512 > URL: https://issues.apache.org/jira/browse/KAFKA-5512 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.10.1.1 > Reporter: Stephane Roset > Labels: performance > Fix For: 0.11.0.1 > > > Hi, > We noticed in our application that the memory allocation rate increased significantly when we have no Kafka messages to consume. We isolated the issue by using a JVM that simply runs 128 Kafka consumers. These consumers consume 128 partitions (so each consumer consumes one partition). The partitions are empty and no message has been sent during the test. The consumers were configured with default values (session.timeout.ms=30000, fetch.max.wait.ms=500, receive.buffer.bytes=65536, heartbeat.interval.ms=3000, max.poll.interval.ms=300000, max.poll.records=500). The Kafka cluster was made of 3 brokers. Within this context, the allocation rate was about 55 MiB/s. This high allocation rate generates a lot of GC activity (to garbage the young heap) and was an issue for our project. > We profiled the JVM with JProfiler. We noticed that there were a huge quantity of ArrayList$Itr in memory. These collections were mainly instantiated by the methods handleCompletedReceives, handleCompletedSends, handleConnecions and handleDisconnections of the class NetWorkClient. We also noticed that we had a lot of calls to the method pollOnce of the class KafkaConsumer. > So we decided to run only one consumer and to profile the calls to the method pollOnce. We noticed that regularly a huge number of calls is made to this method, up to 268000 calls within 100ms. The pollOnce method calls the NetworkClient.handle* methods. These methods iterate on collections (even if they are empty), so that explains the huge number of iterators in memory. > The large number of calls is related to the heartbeat mechanism. The pollOnce method calculates the poll timeout; if a heartbeat needs to be done, the timeout will be set to 0. The problem is that the heartbeat thread checks every 100 ms (default value of retry.backoff.ms) if a heartbeat should be sent, so the KafkaConsumer will call the poll method in a loop without timeout until the heartbeat thread awakes. For example: the heartbeat thread just started to wait and will awake in 99ms. So during 99ms, the KafkaConsumer will call in a loop the pollOnce method and will use a timeout of 0. That explains how we can have 268000 calls within 100ms. > The heartbeat thread calls the method AbstractCoordinator.wait() to sleep, so I think the Kafka consumer should awake the heartbeat thread with a notify when needed. > We made two quick fixes to solve this issue: > - In NetworkClient.handle*(), we don't iterate on collections if they are empty (to avoid unnecessary iterators instantiations). > - In KafkaConsumer.pollOnce(), if the poll timeout is equal to 0 we notify the heartbeat thread to awake it (dirty fix because we don't handle the autocommit case). > With these 2 quick fixes and 128 consumers, the allocation rate drops down from 55 MiB/s to 4 MiB/s. -- This message was sent by Atlassian JIRA (v6.4.14#64029)