ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Facility to detect long STW pauses and other system response degradations
Date Wed, 22 Nov 2017 13:10:00 GMT
The question is how administrator should interpret these numbers. Ok, I
opened JMX console and see that there was 10 long GC events, which took 100
seconds.
1) When did these events appear? Over the last day, which is more or less
OK, or over the last 10 minutes, so my server is nearly dead?
2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
pauses of 1 sec and one critical pause of 90s?

May be a kind of sliding window plus min/max values will do better job.

On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <avinogradov@gridgain.com>
wrote:

> Vova,
>
> 1) We can gain collections info from GarbageCollectorMXBean
> But it provides only
> - collectionCount
> - collectionTime.
>
> This is very interesting metrics, but they tell us nothing about long STW.
> Long STW means we have huge latency during STW and we should find the
> reason and solve it.
>
> 2) So, we're working on new incremental metrics
> - total amount of STW longer than XXX
> - total duration of STW longer than XXX
>
> which shows us JVM/GC health situation.
>
> Is it answer to your question?
>
> On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
>
> > Anton,
> >
> > The question is why user may need so precise measurement? I share
> Andrey’s
> > opinion - cannot understand the value.
> >
> > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <avinogradov@gridgain.com
> >:
> >
> > > Andrey,
> > >
> > > >  JVM provides sufficient means of detecting a struggling process out
> of
> > > the box.
> > >
> > > Could you point to some articles describing how to detect STW exceeding
> > > some duration using only JVM API?
> > >
> > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> andrewkornev@hotmail.com
> > >
> > > wrote:
> > >
> > > > My 2 cents. Don’t do it. JVM provides sufficient means of detecting
a
> > > > struggling process out of the box. SRE/Operations teams usually know
> > how
> > > to
> > > > monitor JVMs and can handle killing of such processes themselves.
> > > >
> > > > The feature adds no value, just complexity (and more configuration
> > > > parameters (!) — as if Ignite didn’t have enough of them already).
> > > >
> > > > Regards,
> > > > Andrey
> > > > _____________________________
> > > > From: Denis Magda <dmagda@apache.org>
> > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > Subject: Re: Facility to detect long STW pauses and other system
> > response
> > > > degradations
> > > > To: <dev@ignite.apache.org>
> > > >
> > > >
> > > > My 2 cents.
> > > >
> > > > 1. Totally for a separate native process that will handle the
> > monitoring
> > > > of an Ignite process. The watchdog process can simply start a JVM
> tool
> > > like
> > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > how-monitor-java-garbage>
> > > >
> > > > 2. As for the STW handling, I would make a possible reaction more
> > > generic.
> > > > Let’s define a policy (enumeration) that will define how to deal with
> > an
> > > > unstable node. The events might be as follows - kill a node, restart
> a
> > > > node, trigger a custom script using Runtime.exec or other methods.
> > > >
> > > > What’d you think? Specifically on point 2.
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > avinogradov@gridgain.com>
> > > > wrote:
> > > > >
> > > > > Yakov,
> > > > >
> > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > >
> > > > > We split issue to
> > > > > #1 STW duration metrics
> > > > > #2 External monitoring allows to stop node during STW
> > > > >
> > > > >> Testing GC pause with java thread is
> > > > >> a bit strange and can give info only after GC pause finishes.
> > > > >
> > > > > That's ok since it's #1
> > > > >
> > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > sbt.sorokin.dvl@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> I have tested solution with java-thread and GC logs had contain
> same
> > > > pause
> > > > >> values of thread stopping which was detected by java-thread.
> > > > >>
> > > > >>
> > > > >> My log (contains pauses > 100ms):
> > > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too
long
> > STW
> > > > >> pause: 507 milliseconds.
> > > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too
long
> > STW
> > > > >> pause: 5595 milliseconds.
> > > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too
long
> > STW
> > > > >> pause: 3262 milliseconds.
> > > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too
long
> > STW
> > > > >> pause: 1737 milliseconds.
> > > > >>
> > > > >> GC log:
> > > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> > application
> > > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > > 0,0000246
> > > > >> seconds
> > > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> > application
> > > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > > 0,0000252
> > > > >> seconds
> > > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> > application
> > > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > > 0,0000178
> > > > >> seconds // GOT!
> > > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> > application
> > > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > > 0,0000229
> > > > >> seconds // GOT!
> > > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > application
> > > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > > 0,0000223
> > > > >> seconds // GOT!
> > > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > application
> > > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > > 0,0000121
> > > > >> seconds // GOT!
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > > >>
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message