Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 15D49200BF1 for ; Tue, 20 Dec 2016 03:32:47 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1487B160B30; Tue, 20 Dec 2016 02:32:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 60791160B21 for ; Tue, 20 Dec 2016 03:32:46 +0100 (CET) Received: (qmail 74783 invoked by uid 500); 20 Dec 2016 02:32:45 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 74772 invoked by uid 99); 20 Dec 2016 02:32:45 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Dec 2016 02:32:45 +0000 Received: from mail-yb0-f169.google.com (mail-yb0-f169.google.com [209.85.213.169]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id DA3EC1A028B for ; Tue, 20 Dec 2016 02:32:44 +0000 (UTC) Received: by mail-yb0-f169.google.com with SMTP id v78so62453558ybe.3 for ; Mon, 19 Dec 2016 18:32:44 -0800 (PST) X-Gm-Message-State: AIkVDXLe96QJsizUGgc13EP2eaIAZ4w5rYYP/CSCEu3ClP26DE9a8vSX+tfFVb6qbcJSoR5Lz2Enc5KgyTlSC0A6 X-Received: by 10.37.161.198 with SMTP id a64mr1744740ybi.72.1482201163750; Mon, 19 Dec 2016 18:32:43 -0800 (PST) MIME-Version: 1.0 Received: by 10.83.0.77 with HTTP; Mon, 19 Dec 2016 18:32:23 -0800 (PST) In-Reply-To: References: From: Zameer Manji Date: Mon, 19 Dec 2016 21:32:23 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Metrics collection affected when libprocess queue builds up To: mesos Content-Type: multipart/alternative; boundary=f403045c64a6dbd12605440dd970 archived-at: Tue, 20 Dec 2016 02:32:47 -0000 --f403045c64a6dbd12605440dd970 Content-Type: text/plain; charset=UTF-8 I believe Zhitao is referring to `/metrics/snapshot` returning a result after 10-30 seconds. I think in a typical environment, this will cause most metrics collection tooling to timeout. This causes the operator to not have any visibility into the system, making debugging/fighting the problem very hard. On Mon, Dec 19, 2016 at 9:23 PM, haosdent wrote: > Hi, @zhitao > > > the `/metrics/snapshot` could take 10-30 seconds to respond. > > Do you mean it `/metrics/snapshot` return result after 10~30 seconds? > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of ` > allocator/mesos/event_queue_dispatches gauge`? > > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li wrote: > > > Hi all, > > > > While I was debugging an allocator message queue build up issue on master > > (which I plan to share another thread), I noticed that > `/metrics/snapshot` > > is also badly affected. > > > > For example, when the allocator queue has ~3k dispatches in it (revealed > by > > the allocator/mesos/event_queue_dispatches gauge), the > `/metrics/snapshot` > > could take 10-30 seconds to respond. > > > > During an active debugging or outage fighting, this is pretty undesired. > > > > My guess is that many stats collection code relies on *deferring* to > > another libprocess and collect the result. > > > > Should we explore a more reliable way to track metrics independently from > > libprocess's queue? > > > > -- > > Cheers, > > > > Zhitao Li > > > > > > -- > Best Regards, > Haosdent Huang > > -- > Zameer Manji > --f403045c64a6dbd12605440dd970--