ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Thu, 23 Nov 2017 15:26:46 GMT
No objections here. Additional policies like EXEC might be added later depending on user needs.

—
Denis

> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin.dvl@gmail.com>
wrote:
> 
> Denis,
> I propose start with first three policies (it's already implemented, just
> await some code combing, commit & review).
> About of fourth policy (EXEC) I think that it's rather additional property
> (some script path) than policy.
> 
> 2017-11-23 0:43 GMT+03:00 Denis Magda <dmagda@apache.org>:
> 
>> Just provide FailureProcessingPolicy with possible reactions:
>> - NOOP - exceptions will be reported, metrics will be triggered but an
>> affected Ignite process won’t be touched.
>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
>> process termination.
>> - RESTART - NOOP actions + process restart.
>> - EXEC - execute a custom script provided by the user.
>> 
>> If needed the policy can be set per know failure such is OOM, Persistence
>> errors so that the user can act accordingly basing on a context.
>> 
>> —
>> Denis
>> 
>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <vozerov@gridgain.com>
>> wrote:
>>> 
>>> In the first iteration I would focus only on reporting facilities, to let
>>> administrator spot dangerous situation. And in the second phase, when all
>>> reporting and metrics are ready, we can think on some automatic actions.
>>> 
>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
>> mcherkasov@gridgain.com
>>>> wrote:
>>> 
>>>> Hi Anton,
>>>> 
>>>> I don't think that we should shutdown node in case of
>> IgniteOOMException,
>>>> if one node has no space, then other probably  don't have it too, so re
>>>> -balancing will cause IgniteOOM on all other nodes and will kill the
>> whole
>>>> cluster. I think for some configurations cluster should survive and
>> allow
>>>> to user clean cache or/and add more nodes.
>>>> 
>>>> Thanks,
>>>> Mikhail.
>>>> 
>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov"
<
>>>> avinogradov@gridgain.com> написал:
>>>> 
>>>>> Igniters,
>>>>> 
>>>>> Internal problems may and, unfortunately, cause unexpected cluster
>>>>> behavior.
>>>>> We should determine behavior in case any of internal problem happened.
>>>>> 
>>>>> Well known internal problems can be split to:
>>>>> 1) OOM or any other reason cause node crash
>>>>> 
>>>>> 2) Situations required graceful node shutdown with custom notification
>>>>> - IgniteOutOfMemoryException
>>>>> - Persistence errors
>>>>> - ExchangeWorker exits with error
>>>>> 
>>>>> 3) Prefomance issues should be covered by metrics
>>>>> - GC STW duration
>>>>> - Timed out tasks and jobs
>>>>> - TX deadlock
>>>>> - Hanged Tx (waits for some service)
>>>>> - Java Deadlocks
>>>>> 
>>>>> I created special issue [1] to make sure all these metrics will be
>>>>> presented at WebConsole or VisorConsole (what's preferred?)
>>>>> 
>>>>> 4) Situations required external monitoring implementation
>>>>> - GC STW duration exceed maximum possible length (node should be
>> stopped
>>>>> before STW finished)
>>>>> 
>>>>> All this problems were reported by different persons different time
>> ago,
>>>>> So, we should reanalyze each of them and, possible, find better ways
to
>>>>> solve them than it described at issues.
>>>>> 
>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
>> something
>>>>> else :)
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
>>>>> [2]
>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>>>>> 7%3A+Ignite+internal+problems+detection
>>>>> 
>>>> 
>> 
>> 


Mime
View raw message