Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mesos.apache.org
MIME-Version: 1.0
In-Reply-To: <CAFt=ROPYEoSRDxdQdBm7_mAanCwxJkuQB3s-TrSjZPyfPQ1MnQ@mail.gmail.com>
References: <CANMwFmV-vMxK=eHHmEAO_ZrDwRY6AOG=ygWBRW7DMBUTm_KsMg@mail.gmail.com>
 <CAFt=ROPYEoSRDxdQdBm7_mAanCwxJkuQB3s-TrSjZPyfPQ1MnQ@mail.gmail.com>
From: Zameer Manji <zmanji@apache.org>
Date: Mon, 19 Dec 2016 21:32:23 -0500
Message-ID: <CAM+cpfdoHAP3Ldc6ZtSLUHnd-n27SDNr5ehLVR6JiVi4y1ktTg@mail.gmail.com>
Subject: Re: Metrics collection affected when libprocess queue builds up
To: mesos <dev@mesos.apache.org>
Content-Type: multipart/alternative; boundary=f403045c64a6dbd12605440dd970
archived-at: Tue, 20 Dec 2016 02:32:47 -0000

--f403045c64a6dbd12605440dd970
Content-Type: text/plain; charset=UTF-8

I believe Zhitao is referring to `/metrics/snapshot` returning a result
after 10-30 seconds.

I think in a typical environment, this will cause most metrics collection
tooling to timeout. This causes the operator to not have any visibility
into the system, making debugging/fighting the problem very hard.

On Mon, Dec 19, 2016 at 9:23 PM, haosdent <haosdent@gmail.com> wrote:

> Hi, @zhitao
>
> > the `/metrics/snapshot` could take 10-30 seconds to respond.
>
> Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> allocator/mesos/event_queue_dispatches gauge`?
>
> On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:
>
> > Hi all,
> >
> > While I was debugging an allocator message queue build up issue on master
> > (which I plan to share another thread), I noticed that
> `/metrics/snapshot`
> > is also badly affected.
> >
> > For example, when the allocator queue has ~3k dispatches in it (revealed
> by
> > the allocator/mesos/event_queue_dispatches gauge), the
> `/metrics/snapshot`
> > could take 10-30 seconds to respond.
> >
> > During an active debugging or outage fighting, this is pretty undesired.
> >
> > My guess is that many stats collection code relies on *deferring* to
> > another libprocess and collect the result.
> >
> > Should we explore a more reliable way to track metrics independently from
> > libprocess's queue?
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
> --
> Zameer Manji
>

--f403045c64a6dbd12605440dd970--