ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Vinogradov <avinogra...@gridgain.com>
Subject Re: Facility to detect long STW pauses and other system response degradations
Date Tue, 21 Nov 2017 10:06:31 GMT
Denis,

> 1. Totally for a separate native process that will handle the monitoring
of an Ignite process. The watchdog process can simply start a JVM tool like
jstat and parse its GC logs: https://dzone.com/articles/
how-monitor-java-garbage <https://dzone.com/articles/
how-monitor-java-garbage>
Different GC and even same GC at different OS/JVM produce different logs.
That's not easy to parse them. But, since http://gceasy.io can do that, it
looks to be possible, somehow :) .
Do you know any libs or solutions allows to do this at realtime?

> 2. As for the STW handling, I would make a possible reaction more
generic. Let’s define a policy (enumeration) that will define how to deal
with an unstable node. The events might be as follows - kill a node,
restart a node, trigger a custom script using Runtime.exec or other methods.
Yes, it should be similar to segmentation policy + custom script execution.


On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <dmagda@apache.org> wrote:

> My 2 cents.
>
> 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
>
> 2. As for the STW handling, I would make a possible reaction more generic.
> Let’s define a policy (enumeration) that will define how to deal with an
> unstable node. The events might be as follows - kill a node, restart a
> node, trigger a custom script using Runtime.exec or other methods.
>
> What’d you think? Specifically on point 2.
>
> —
> Denis
>
> > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <avinogradov@gridgain.com>
> wrote:
> >
> > Yakov,
> >
> > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> >
> > We split issue to
> > #1 STW duration metrics
> > #2 External monitoring allows to stop node during STW
> >
> >> Testing GC pause with java thread is
> >> a bit strange and can give info only after GC pause finishes.
> >
> > That's ok since it's #1
> >
> > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> sbt.sorokin.dvl@gmail.com>
> > wrote:
> >
> >> I have tested solution with java-thread and GC logs had contain same
> pause
> >> values of thread stopping which was detected by java-thread.
> >>
> >>
> >> My log (contains pauses > 100ms):
> >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> >> pause: 507 milliseconds.
> >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> >> pause: 5595 milliseconds.
> >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> >> pause: 3262 milliseconds.
> >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> >> pause: 1737 milliseconds.
> >>
> >> GC log:
> >> gridgain@dell-5580-92zc8h2:~$ cat
> >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> 0,0000246
> >> seconds
> >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> 0,0000252
> >> seconds
> >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> 0,0000178
> >> seconds    // GOT!
> >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> 0,0000229
> >> seconds    // GOT!
> >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> 0,0000223
> >> seconds    // GOT!
> >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> 0,0000121
> >> seconds    // GOT!
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message