Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 09196200D0D for ; Fri, 25 Aug 2017 22:37:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0780A16D31E; Fri, 25 Aug 2017 20:37:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5770016D31C for ; Fri, 25 Aug 2017 22:37:06 +0200 (CEST) Received: (qmail 31922 invoked by uid 500); 25 Aug 2017 20:37:03 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 31911 invoked by uid 99); 25 Aug 2017 20:37:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Aug 2017 20:37:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3C6AB180B05 for ; Fri, 25 Aug 2017 20:37:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id J0yXiTEFId3K for ; Fri, 25 Aug 2017 20:37:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id AB3405FDA1 for ; Fri, 25 Aug 2017 20:37:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E68D3E0C1D for ; Fri, 25 Aug 2017 20:37:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6C05525383 for ; Fri, 25 Aug 2017 20:37:00 +0000 (UTC) Date: Fri, 25 Aug 2017 20:37:00 +0000 (UTC) From: "Jeff Chao (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (KAFKA-5452) Aggressive log compaction ratio appears to have no negative effect on log-compacted topics MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 25 Aug 2017 20:37:07 -0000 [ https://issues.apache.org/jira/browse/KAFKA-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Chao resolved KAFKA-5452. ------------------------------ Resolution: Resolved Following up after a long while. After talking offline with [~wushujames], the original thought was to choose a sensible default in relation to disk I/O. I think it's best to leave this default and prevent assumptions on the underlying infrastructure. That way, operators are free to tune to their expectations. Closing this. > Aggressive log compaction ratio appears to have no negative effect on log-compacted topics > ------------------------------------------------------------------------------------------ > > Key: KAFKA-5452 > URL: https://issues.apache.org/jira/browse/KAFKA-5452 > Project: Kafka > Issue Type: Improvement > Components: config, core, log > Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8 > Reporter: Jeff Chao > Labels: performance > Attachments: 200mbs-dirty0-dirty-1-dirty05.png, flame-graph-200mbs-dirty0.png, flame-graph-200mbs-dirty0.svg > > > Some of our users are seeing unintuitive/unexpected behavior with log-compacted topics where they receive multiple records for the same key when consuming. This is a result of low throughput on log-compacted topics such that conditions ({{min.cleanable.dirty.ratio = 0.5}}, default) aren't met for compaction to kick in. > This prompted us to test and tune {{min.cleanable.dirty.ratio}} in our clusters. It appears that having more aggressive log compaction ratios don't have negative effects on CPU and memory utilization. If this is truly the case, we should consider changing the default from {{0.5}} to something more aggressive. > Setup: > # 8 brokers > # 5 zk nodes > # 32 partitions on a topic > # replication factor 3 > # log roll 3 hours > # log segment bytes 1 GB > # log retention 24 hours > # all messages to a single key > # all messages to a unique key > # all messages to a bounded key range [0, 999] > # {{min.cleanable.dirty.ratio}} per topic = {{0}}, {{0.5}}, and {{1}} > # 200 MB/s sustained, produce and consume traffic > Observations: > We were able to verify log cleaner threads were performing work by checking the logs and verifying the {{cleaner-offset-checkpoint}} file for all topics. We also observed the log cleaner's {{time-since-last-run-ms}} metric was normal, never going above the default of 15 seconds. > Under-replicated partitions stayed steady, same for replication lag. > Here's an example test run where we try out {{min.cleanable.dirty.ratio = 0}}, {{min.cleanable.dirty.ratio = 1}}, and {{min.cleanable.dirty.ratio = 0.5}}. Troughs in between the peaks represent zero traffic and reconfiguring of topics. > (200mbs-dirty-0-dirty1-dirty05.png attached) > !200mbs-dirty0-dirty-1-dirty05.png|thumbnail! > Memory utilization is fine, but more interestingly, CPU doesn't appear to have much difference. > To get more detail, here is a flame graph (raw svg attached) of the run for {{min.cleanable.dirty.ratio = 0}}. The conservative and default ratio flame graphs are equivalent. > (flame-graph-200mbs-dirty0.png attached) > !flame-graph-200mbs-dirty0.png|thumbnail! > Notice that the majority of CPU is coming from: > # SSL operations (on reads/writes) > # KafkaApis::handleFetchRequest (ReplicaManager::fetchMessages) > # KafkaApis::handleOffsetFetchRequest > We also have examples from small scale test runs which show similar behavior but with scaled down CPU usage. > It seems counterintuitive that there's no apparent difference in CPU whether it be aggressive or conservative compaction ratios, so we'd like to get some thoughts from the community. > We're looking for feedback on whether or not anyone else has experienced this behavior before as well or, if CPU isn't affected, has anyone seen something related instead. > If this is true, then we'd be happy to discuss further and provide a patch. -- This message was sent by Atlassian JIRA (v6.4.14#64029)