mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Metrics collection affected when libprocess queue builds up
Date Tue, 20 Dec 2016 02:32:23 GMT
I believe Zhitao is referring to `/metrics/snapshot` returning a result
after 10-30 seconds.

I think in a typical environment, this will cause most metrics collection
tooling to timeout. This causes the operator to not have any visibility
into the system, making debugging/fighting the problem very hard.

On Mon, Dec 19, 2016 at 9:23 PM, haosdent <haosdent@gmail.com> wrote:

> Hi, @zhitao
>
> > the `/metrics/snapshot` could take 10-30 seconds to respond.
>
> Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> allocator/mesos/event_queue_dispatches gauge`?
>
> On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:
>
> > Hi all,
> >
> > While I was debugging an allocator message queue build up issue on master
> > (which I plan to share another thread), I noticed that
> `/metrics/snapshot`
> > is also badly affected.
> >
> > For example, when the allocator queue has ~3k dispatches in it (revealed
> by
> > the allocator/mesos/event_queue_dispatches gauge), the
> `/metrics/snapshot`
> > could take 10-30 seconds to respond.
> >
> > During an active debugging or outage fighting, this is pretty undesired.
> >
> > My guess is that many stats collection code relies on *deferring* to
> > another libprocess and collect the result.
> >
> > Should we explore a more reliable way to track metrics independently from
> > libprocess's queue?
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
> --
> Zameer Manji
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message