ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Wed, 22 Nov 2017 21:43:10 GMT
Just provide FailureProcessingPolicy with possible reactions:
- NOOP - exceptions will be reported, metrics will be triggered but an affected Ignite process
won’t be touched.
- HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite process termination.
- RESTART - NOOP actions + process restart.
- EXEC - execute a custom script provided by the user.

If needed the policy can be set per know failure such is OOM, Persistence errors so that the
user can act accordingly basing on a context.

—
Denis

> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <vozerov@gridgain.com> wrote:
> 
> In the first iteration I would focus only on reporting facilities, to let
> administrator spot dangerous situation. And in the second phase, when all
> reporting and metrics are ready, we can think on some automatic actions.
> 
> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <mcherkasov@gridgain.com
>> wrote:
> 
>> Hi Anton,
>> 
>> I don't think that we should shutdown node in case of IgniteOOMException,
>> if one node has no space, then other probably  don't have it too, so re
>> -balancing will cause IgniteOOM on all other nodes and will kill the whole
>> cluster. I think for some configurations cluster should survive and allow
>> to user clean cache or/and add more nodes.
>> 
>> Thanks,
>> Mikhail.
>> 
>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" <
>> avinogradov@gridgain.com> написал:
>> 
>>> Igniters,
>>> 
>>> Internal problems may and, unfortunately, cause unexpected cluster
>>> behavior.
>>> We should determine behavior in case any of internal problem happened.
>>> 
>>> Well known internal problems can be split to:
>>> 1) OOM or any other reason cause node crash
>>> 
>>> 2) Situations required graceful node shutdown with custom notification
>>> - IgniteOutOfMemoryException
>>> - Persistence errors
>>> - ExchangeWorker exits with error
>>> 
>>> 3) Prefomance issues should be covered by metrics
>>> - GC STW duration
>>> - Timed out tasks and jobs
>>> - TX deadlock
>>> - Hanged Tx (waits for some service)
>>> - Java Deadlocks
>>> 
>>> I created special issue [1] to make sure all these metrics will be
>>> presented at WebConsole or VisorConsole (what's preferred?)
>>> 
>>> 4) Situations required external monitoring implementation
>>> - GC STW duration exceed maximum possible length (node should be stopped
>>> before STW finished)
>>> 
>>> All this problems were reported by different persons different time ago,
>>> So, we should reanalyze each of them and, possible, find better ways to
>>> solve them than it described at issues.
>>> 
>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something
>>> else :)
>>> 
>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
>>> [2]
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>>> 7%3A+Ignite+internal+problems+detection
>>> 
>> 


Mime
View raw message