ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kornev <andrewkor...@hotmail.com>
Subject Re: Facility to detect long STW pauses and other system response degradations
Date Tue, 21 Nov 2017 16:17:11 GMT
My 2 cents. Don’t do it. JVM provides sufficient means of detecting a struggling process
out of the box. SRE/Operations teams usually know how to monitor JVMs and can handle killing
of such processes themselves.

The feature adds no value, just complexity (and more configuration parameters (!) — as if
Ignite didn’t have enough of them already).

Regards,
Andrey
_____________________________
From: Denis Magda <dmagda@apache.org>
Sent: Monday, November 20, 2017 3:10 PM
Subject: Re: Facility to detect long STW pauses and other system response degradations
To: <dev@ignite.apache.org>


My 2 cents.

1. Totally for a separate native process that will handle the monitoring of an Ignite process.
The watchdog process can simply start a JVM tool like jstat and parse its GC logs: https://dzone.com/articles/how-monitor-java-garbage
<https://dzone.com/articles/how-monitor-java-garbage>

2. As for the STW handling, I would make a possible reaction more generic. Let’s define
a policy (enumeration) that will define how to deal with an unstable node. The events might
be as follows - kill a node, restart a node, trigger a custom script using Runtime.exec or
other methods.

What’d you think? Specifically on point 2.

—
Denis

> On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <avinogradov@gridgain.com> wrote:
>
> Yakov,
>
> Issue is https://issues.apache.org/jira/browse/IGNITE-6171
>
> We split issue to
> #1 STW duration metrics
> #2 External monitoring allows to stop node during STW
>
>> Testing GC pause with java thread is
>> a bit strange and can give info only after GC pause finishes.
>
> That's ok since it's #1
>
> On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <sbt.sorokin.dvl@gmail.com>
> wrote:
>
>> I have tested solution with java-thread and GC logs had contain same pause
>> values of thread stopping which was detected by java-thread.
>>
>>
>> My log (contains pauses > 100ms):
>> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
>> pause: 507 milliseconds.
>> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
>> pause: 5595 milliseconds.
>> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
>> pause: 3262 milliseconds.
>> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
>> pause: 1737 milliseconds.
>>
>> GC log:
>> gridgain@dell-5580-92zc8h2:~$ cat
>> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
>> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
>> threads were stopped: 0,0000845 seconds, Stopping threads took: 0,0000246
>> seconds
>> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
>> threads were stopped: 0,0001072 seconds, Stopping threads took: 0,0000252
>> seconds
>> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
>> threads were stopped: 0,5001082 seconds, Stopping threads took: 0,0000178
>> seconds // GOT!
>> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
>> threads were stopped: 5,5856603 seconds, Stopping threads took: 0,0000229
>> seconds // GOT!
>> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
>> threads were stopped: 3,2595700 seconds, Stopping threads took: 0,0000223
>> seconds // GOT!
>> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
>> threads were stopped: 1,7337123 seconds, Stopping threads took: 0,0000121
>> seconds // GOT!
>>
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>>




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message