ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Facility to detect long STW pauses and other system response degradations
Date Tue, 28 Nov 2017 07:27:16 GMT
Makes sense.

On Wed, Nov 22, 2017 at 4:54 PM, Anton Vinogradov <avinogradov@gridgain.com>
wrote:

> Vova,
>
> Monitoring systems works not how you expected, they LOVE incremental
> metrics :)
>
> > 1) When did these events appear?
>
> Monitoring gets metrics each N seconds.
>
> 01.00 got 0, 0
> 02.00 got 1, 20 -> means between 02.00 and 01.00 was 1 STW with duration
> 20
> 03.00 got 3, 100  -> means between 03.00 and 02.00 was 2 STW with total
> duration 80
>
> Good monitoring will record this values and show you graph when you decide
> to check this value.
> So, you'll see "0,1,2" and "0,20,80" at 03.01
>
> > 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> > pauses of 1 sec and one critical pause of 90s?
>
> So, previous probe was 0, 0 and we got it minute ago.
> Now we have 10, 100. It means we have 10 STW with total duration 100 last
> minute.
>
> But, it case we set interval to 10 seconds we'll get
> 1, 20
> 4, 20
> 0, 0
> 4, 10
> 1, 50
> 0, 0
>
> So, precision depends on check interval. And it's up to administration team
> what interval to choose.
>
> > May be a kind of sliding window plus min/max values will do better job.
>
> That's the monitoring system job (eg. zabbix)
>
>
> On Wed, Nov 22, 2017 at 4:10 PM, Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
>
> > The question is how administrator should interpret these numbers. Ok, I
> > opened JMX console and see that there was 10 long GC events, which took
> 100
> > seconds.
> > 1) When did these events appear? Over the last day, which is more or less
> > OK, or over the last 10 minutes, so my server is nearly dead?
> > 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> > pauses of 1 sec and one critical pause of 90s?
> >
> > May be a kind of sliding window plus min/max values will do better job.
> >
> > On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <
> > avinogradov@gridgain.com>
> > wrote:
> >
> > > Vova,
> > >
> > > 1) We can gain collections info from GarbageCollectorMXBean
> > > But it provides only
> > > - collectionCount
> > > - collectionTime.
> > >
> > > This is very interesting metrics, but they tell us nothing about long
> > STW.
> > > Long STW means we have huge latency during STW and we should find the
> > > reason and solve it.
> > >
> > > 2) So, we're working on new incremental metrics
> > > - total amount of STW longer than XXX
> > > - total duration of STW longer than XXX
> > >
> > > which shows us JVM/GC health situation.
> > >
> > > Is it answer to your question?
> > >
> > > On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <vozerov@gridgain.com
> >
> > > wrote:
> > >
> > > > Anton,
> > > >
> > > > The question is why user may need so precise measurement? I share
> > > Andrey’s
> > > > opinion - cannot understand the value.
> > > >
> > > > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <
> > avinogradov@gridgain.com
> > > >:
> > > >
> > > > > Andrey,
> > > > >
> > > > > >  JVM provides sufficient means of detecting a struggling process
> > out
> > > of
> > > > > the box.
> > > > >
> > > > > Could you point to some articles describing how to detect STW
> > exceeding
> > > > > some duration using only JVM API?
> > > > >
> > > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> > > andrewkornev@hotmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > My 2 cents. Don’t do it. JVM provides sufficient means of
> > detecting a
> > > > > > struggling process out of the box. SRE/Operations teams usually
> > know
> > > > how
> > > > > to
> > > > > > monitor JVMs and can handle killing of such processes themselves.
> > > > > >
> > > > > > The feature adds no value, just complexity (and more
> configuration
> > > > > > parameters (!) — as if Ignite didn’t have enough of them
> already).
> > > > > >
> > > > > > Regards,
> > > > > > Andrey
> > > > > > _____________________________
> > > > > > From: Denis Magda <dmagda@apache.org>
> > > > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > > > Subject: Re: Facility to detect long STW pauses and other system
> > > > response
> > > > > > degradations
> > > > > > To: <dev@ignite.apache.org>
> > > > > >
> > > > > >
> > > > > > My 2 cents.
> > > > > >
> > > > > > 1. Totally for a separate native process that will handle the
> > > > monitoring
> > > > > > of an Ignite process. The watchdog process can simply start
a JVM
> > > tool
> > > > > like
> > > > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > > > how-monitor-java-garbage>
> > > > > >
> > > > > > 2. As for the STW handling, I would make a possible reaction
more
> > > > > generic.
> > > > > > Let’s define a policy (enumeration) that will define how to
deal
> > with
> > > > an
> > > > > > unstable node. The events might be as follows - kill a node,
> > restart
> > > a
> > > > > > node, trigger a custom script using Runtime.exec or other
> methods.
> > > > > >
> > > > > > What’d you think? Specifically on point 2.
> > > > > >
> > > > > > —
> > > > > > Denis
> > > > > >
> > > > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > > > avinogradov@gridgain.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Yakov,
> > > > > > >
> > > > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > > > >
> > > > > > > We split issue to
> > > > > > > #1 STW duration metrics
> > > > > > > #2 External monitoring allows to stop node during STW
> > > > > > >
> > > > > > >> Testing GC pause with java thread is
> > > > > > >> a bit strange and can give info only after GC pause
finishes.
> > > > > > >
> > > > > > > That's ok since it's #1
> > > > > > >
> > > > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > > > sbt.sorokin.dvl@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I have tested solution with java-thread and GC logs
had
> contain
> > > same
> > > > > > pause
> > > > > > >> values of thread stopping which was detected by java-thread.
> > > > > > >>
> > > > > > >>
> > > > > > >> My log (contains pauses > 100ms):
> > > > > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible
too
> > long
> > > > STW
> > > > > > >> pause: 507 milliseconds.
> > > > > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible
too
> > long
> > > > STW
> > > > > > >> pause: 5595 milliseconds.
> > > > > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible
too
> > long
> > > > STW
> > > > > > >> pause: 3262 milliseconds.
> > > > > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible
too
> > long
> > > > STW
> > > > > > >> pause: 1737 milliseconds.
> > > > > > >>
> > > > > > >> GC log:
> > > > > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > > > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep
Total
> > > > > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for
which
> > > > application
> > > > > > >> threads were stopped: 0,0000845 seconds, Stopping threads
> took:
> > > > > > 0,0000246
> > > > > > >> seconds
> > > > > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for
which
> > > > application
> > > > > > >> threads were stopped: 0,0001072 seconds, Stopping threads
> took:
> > > > > > 0,0000252
> > > > > > >> seconds
> > > > > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for
which
> > > > application
> > > > > > >> threads were stopped: 0,5001082 seconds, Stopping threads
> took:
> > > > > > 0,0000178
> > > > > > >> seconds // GOT!
> > > > > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for
which
> > > > application
> > > > > > >> threads were stopped: 5,5856603 seconds, Stopping threads
> took:
> > > > > > 0,0000229
> > > > > > >> seconds // GOT!
> > > > > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for
which
> > > > application
> > > > > > >> threads were stopped: 3,2595700 seconds, Stopping threads
> took:
> > > > > > 0,0000223
> > > > > > >> seconds // GOT!
> > > > > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for
which
> > > > application
> > > > > > >> threads were stopped: 1,7337123 seconds, Stopping threads
> took:
> > > > > > 0,0000121
> > > > > > >> seconds // GOT!
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.
> > com/
> > > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message