Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6AD13200C68 for ; Wed, 3 May 2017 20:52:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 69626160BB5; Wed, 3 May 2017 18:52:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 873D6160BA1 for ; Wed, 3 May 2017 20:52:16 +0200 (CEST) Received: (qmail 30466 invoked by uid 500); 3 May 2017 18:52:15 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 30454 invoked by uid 99); 3 May 2017 18:52:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 May 2017 18:52:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CFC0BC692C for ; Wed, 3 May 2017 18:52:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id YZTZy8jqFIZA for ; Wed, 3 May 2017 18:52:12 +0000 (UTC) Received: from mail-lf0-f48.google.com (mail-lf0-f48.google.com [209.85.215.48]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E43495F665 for ; Wed, 3 May 2017 18:52:11 +0000 (UTC) Received: by mail-lf0-f48.google.com with SMTP id 75so73986480lfs.2 for ; Wed, 03 May 2017 11:52:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=nOcykXtwcPJNndxJSvpoTo02eACAw4D1CgJmhEiKD/I=; b=Umea7R3+W501E1VPvcvj+T2FfcwZQ//fpPjWY02oP9iRO+PrG0qoP6fds++oqyuWYj iBH7O9NTzUuRaXs1reJ/hvDbj/HQaHUrbD4KgPyFK6/aWTeQ1mQMUagSpWoamRJNquCq VkLJqtPhafqvhpjjbDBLudlTP0hhNhr7ucmfb1fVywsTd4Q6WpmzS8vdFDkaOYIlIwOU iL7AEmz+nN655xceSjon8+36oGNGYsfd3/RgWPkah+//QzpZtrTdZ2h+th49EJJM7xuB mZIZGpJDxQPnXdWKnxeBriDMAZoYi6P67cM/9FF+vItJtiXYSyIH+3EFD1mgGeMACpqx /Udg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=nOcykXtwcPJNndxJSvpoTo02eACAw4D1CgJmhEiKD/I=; b=iLMHOiaq9svpIEfDZXoSbRU7WNRqVUOb2tZ7fecgG/Fy9nev6aEoZN1ot974H0YlxN KiEKYRoZBK/zHdQimO8nKu1mLm90SqW5NPRe6QiyHc7O1uVYg1OYaNWCL4qR/qoeSrgC tEsieae4MmP0ruSLpvbeGn0B0POFrty8iTmw8ODmjEPc0ElOkqeFVEUZzAly1PcgtybF ZbrobynMoFdHz5cdeo0atIpZtd42dNU01z75RXLXn2ONKBsA+Fi2PpEPyjlSDeX2MFwC wbIOBflhF/IBnYQzedrF61Nsptcbj78x/SosIIJH95NfzvTdVRs4Fs+CQ/1ovnQu115F wy6Q== X-Gm-Message-State: AN3rC/65pCuUb7O+aYwCUfZ7Hwsnuemhjhflr+ombhUF02hws0dt4JYz 6fzrxf5egdpHMC+ZUDUlXtT/RD7mmA== X-Received: by 10.46.0.23 with SMTP id 23mr12222693lja.33.1493837524822; Wed, 03 May 2017 11:52:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.46.32.135 with HTTP; Wed, 3 May 2017 11:52:04 -0700 (PDT) In-Reply-To: References: From: Joel Koshy Date: Wed, 3 May 2017 11:52:04 -0700 Message-ID: Subject: Re: [DISCUSS] KIP-143: Controller Health Metrics To: Onur Karaman Cc: "dev@kafka.apache.org" , Onur Karaman Content-Type: multipart/alternative; boundary=001a1142c5ee04f3c2054ea327de archived-at: Wed, 03 May 2017 18:52:17 -0000 --001a1142c5ee04f3c2054ea327de Content-Type: text/plain; charset=UTF-8 On Wed, May 3, 2017 at 10:54 AM, Onur Karaman wrote: > Regarding the ControllerState and the potential for overlap, I think it > depends on our definition of controller state. While KAFKA-5028 allows only > a single ControllerEvent to be processed at a time, it still allows > interleavings for long-lasting actions like partition reassignment and > topic deletion. For instance, a topic can get created while another topic > is undergoing partition reassignment. In that sense, there is overlap. > However, in the sense of the ControllerEvents being processed, there can be > no overlap. > Yes - that is roughly what I was thinking (although deletes are no longer long running). Also, what is the "steady-state" controller state? Idle? What about a broker that is not the controller? Would you need a separate idle-not-controller state? Given that most of the state changes are short we would just see blips in the best case and nothing in the worst case (depending on how often metrics get sampled). It would only help if you want to visually detect any transitions that are taking an inordinate duration. > > 1. Yes, the long term goal is to migrate the metrics on the broker to >> > kafka-metrics. Since many people are using Yammer reporters, we probably >> > need to support a few popular ones in kafka-metrics before migrating. >> Until >> > that happens, we probably want to stick with the Yammer metrics on the >> > server side unless we depend on features from kafka-metrics (e.g, >> quota). >> > Ok - my thought was since we are already using kafka-metrics for quotas and selector metrics we could just do the same for this (and any *new* metrics on the broker). > 4. Metrics #2 and #3. The issue with relying on metric #1 is that the >> > latter is sensitive to the frequency of metric collection. For example, >> if >> > the starting of the controller takes 30 secs and the metric is only >> > collected once a minute, one may not know the latency with just metric >> #1, >> > but will know the latency with metrics #2 and #3. Are you concerned >> about >> > the memory overhead of histograms? It doesn't seem that a couple of more >> > histograms will hurt. >> > No I don't have concerns about the histograms - just wondering if it is useful enough to have these in the first place, but your summary makes sense. Joel > > >> > Hi, Isamel, >> > >> > Thanks the for proposal. A couple of more comments., >> > >> > 10. It would be useful to add a new metrics for the controller queue >> size. >> > kafka.controller:type=ControllerStats,name=QueueSize >> > >> > 11. It would also be useful to know how long an event is waiting in the >> > controller queue before being processing. Perhaps, we can add a >> histogram >> > metric like the following. >> > kafka.controller:type=ControllerStats,name=QueueTimeMs >> > >> > Jun >> > >> > On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy >> wrote: >> > >> > > Thanks for the KIP - couple of comments: >> > > - Do you intend to actually use yammer metrics? or use kafka-metrics >> and >> > > split the timer into an explicit rate and time? I think long term we >> > ought >> > > to move off yammer and use kafka-metrics only. Actually either is >> fine, >> > but >> > > we should ideally use only one in the long term - and I thought the >> plan >> > > was to use kafka-metrics. >> > > - metric #9 appears to be redundant since we already have per-API >> request >> > > rate and time metrics. >> > > - Same for metric #4, #5 (as there are request stats for >> > > DeleteTopicRequest - although it is possible for users to trigger >> deletes >> > > via ZK) >> > > - metric #2, #3 are potentially useful, but a bit overkill for a >> > > histogram. Alternative is to stick to last known value, but that >> doesn't >> > > play well with alerts if a high value isn't reset/decayed. Perhaps >> metric >> > > #1 would be sufficient to gauge slow start/resignation transitions. >> > > - metric #1 - some of the states may actually overlap >> > > - I don't actually understand the semantics of metric #6. Is it rate >> of >> > > partition reassignment triggers? does the number of partitions matter? >> > > >> > > Joel >> > > >> > > On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford >> > > wrote: >> > > >> > >> Ismael, >> > >> >> > >> Great, that sounds lovely. >> > >> >> > >> I'd like a `Timer` (using yammer metrics parlance) over how long it >> took >> > >> to >> > >> process the event, so we can get at p99 and max times spent >> processing >> > >> things. Maybe we could even do a log at warning level if event >> > processing >> > >> takes over some timeout? >> > >> >> > >> Thanks >> > >> >> > >> Tom >> > >> >> > >> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma >> wrote: >> > >> >> > >> > Hi Tom, >> > >> > >> > >> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free >> > >> > approach for the new metrics. I considered mentioning that in the >> KIP >> > >> > given KAFKA-5120, but didn't in the end. I'll add it to make it >> clear. >> > >> > >> > >> > Regarding locks, they are removed by KAFKA-5028, as you say. So, >> if I >> > >> > understand correctly, you are suggesting an event processing rate >> > metric >> > >> > with event type as a tag? Onur and Jun, what do you think? >> > >> > >> > >> > Ismael >> > >> > >> > >> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford < >> tcrayford@heroku.com> >> > >> > wrote: >> > >> > >> > >> > > Hi, >> > >> > > >> > >> > > We (Heroku) are very excited about this KIP, as we've struggled a >> > bit >> > >> > with >> > >> > > controller stability recently. Having these additional metrics >> would >> > >> be >> > >> > > wonderful. >> > >> > > >> > >> > > I'd like to ensure polling these metrics *doesn't* hold any locks >> > etc, >> > >> > > because, as noted in https://issues.apache.org/ >> > jira/browse/KAFKA-5120 >> > >> , >> > >> > > that >> > >> > > lock can be held for quite some time. This may become not an >> issue >> > as >> > >> of >> > >> > > KAFKA-5028 though. >> > >> > > >> > >> > > Lastly, I'd love to see some metrics around how long the >> controller >> > >> > spends >> > >> > > inside its lock. We've been tracking an issue ( >> > >> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can >> hold >> > >> the >> > >> > > lock for many, many minutes in a zk client listener thread when >> > >> > responding >> > >> > > to a single request. I'm not sure how that plays into >> > >> > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume >> > will >> > >> > land >> > >> > > before this metrics patch), but it feels like there will be >> > equivalent >> > >> > > problems ("how long does it spend processing any individual >> message >> > >> from >> > >> > > the queue, broken down by message type"). >> > >> > > >> > >> > > These are minor improvements though, the addition of more >> metrics to >> > >> the >> > >> > > controller is already going to be very helpful. >> > >> > > >> > >> > > Thanks >> > >> > > >> > >> > > Tom Crayford >> > >> > > Heroku Kafka >> > >> > > >> > >> > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma >> > >> wrote: >> > >> > > >> > >> > > > Hi all, >> > >> > > > >> > >> > > > We've posted "KIP-143: Controller Health Metrics" for >> discussion: >> > >> > > > >> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- >> > >> > > > 143%3A+Controller+Health+Metrics >> > >> > > > >> > >> > > > Please take a look. Your feedback is appreciated. >> > >> > > > >> > >> > > > Thanks, >> > >> > > > Ismael >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> > > --001a1142c5ee04f3c2054ea327de--