ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Wed, 29 Nov 2017 07:35:27 GMT
Denis,

Yes, but can we look at proposed API before we dig into implementation?

On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dmagda@apache.org> wrote:

> I think the failure processing policy should be configured via
> IgniteConfiguration in a way similar to the segmentation policies.
>
> —
> Denis
>
> > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
> >
> > Dmitry,
> >
> > How these policies will be configured? Do you have any API in mind?
> >
> > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dmagda@apache.org> wrote:
> >
> >> No objections here. Additional policies like EXEC might be added later
> >> depending on user needs.
> >>
> >> —
> >> Denis
> >>
> >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <
> sbt.sorokin.dvl@gmail.com>
> >> wrote:
> >>>
> >>> Denis,
> >>> I propose start with first three policies (it's already implemented,
> just
> >>> await some code combing, commit & review).
> >>> About of fourth policy (EXEC) I think that it's rather additional
> >> property
> >>> (some script path) than policy.
> >>>
> >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dmagda@apache.org>:
> >>>
> >>>> Just provide FailureProcessingPolicy with possible reactions:
> >>>> - NOOP - exceptions will be reported, metrics will be triggered but
an
> >>>> affected Ignite process won’t be touched.
> >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
> >>>> process termination.
> >>>> - RESTART - NOOP actions + process restart.
> >>>> - EXEC - execute a custom script provided by the user.
> >>>>
> >>>> If needed the policy can be set per know failure such is OOM,
> >> Persistence
> >>>> errors so that the user can act accordingly basing on a context.
> >>>>
> >>>> —
> >>>> Denis
> >>>>
> >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <vozerov@gridgain.com>
> >>>> wrote:
> >>>>>
> >>>>> In the first iteration I would focus only on reporting facilities,
to
> >> let
> >>>>> administrator spot dangerous situation. And in the second phase,
when
> >> all
> >>>>> reporting and metrics are ready, we can think on some automatic
> >> actions.
> >>>>>
> >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
> >>>> mcherkasov@gridgain.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Hi Anton,
> >>>>>>
> >>>>>> I don't think that we should shutdown node in case of
> >>>> IgniteOOMException,
> >>>>>> if one node has no space, then other probably  don't have it
too, so
> >> re
> >>>>>> -balancing will cause IgniteOOM on all other nodes and will
kill the
> >>>> whole
> >>>>>> cluster. I think for some configurations cluster should survive
and
> >>>> allow
> >>>>>> to user clean cache or/and add more nodes.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Mikhail.
> >>>>>>
> >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton
Vinogradov" <
> >>>>>> avinogradov@gridgain.com> написал:
> >>>>>>
> >>>>>>> Igniters,
> >>>>>>>
> >>>>>>> Internal problems may and, unfortunately, cause unexpected
cluster
> >>>>>>> behavior.
> >>>>>>> We should determine behavior in case any of internal problem
> >> happened.
> >>>>>>>
> >>>>>>> Well known internal problems can be split to:
> >>>>>>> 1) OOM or any other reason cause node crash
> >>>>>>>
> >>>>>>> 2) Situations required graceful node shutdown with custom
> >> notification
> >>>>>>> - IgniteOutOfMemoryException
> >>>>>>> - Persistence errors
> >>>>>>> - ExchangeWorker exits with error
> >>>>>>>
> >>>>>>> 3) Prefomance issues should be covered by metrics
> >>>>>>> - GC STW duration
> >>>>>>> - Timed out tasks and jobs
> >>>>>>> - TX deadlock
> >>>>>>> - Hanged Tx (waits for some service)
> >>>>>>> - Java Deadlocks
> >>>>>>>
> >>>>>>> I created special issue [1] to make sure all these metrics
will be
> >>>>>>> presented at WebConsole or VisorConsole (what's preferred?)
> >>>>>>>
> >>>>>>> 4) Situations required external monitoring implementation
> >>>>>>> - GC STW duration exceed maximum possible length (node should
be
> >>>> stopped
> >>>>>>> before STW finished)
> >>>>>>>
> >>>>>>> All this problems were reported by different persons different
time
> >>>> ago,
> >>>>>>> So, we should reanalyze each of them and, possible, find
better
> ways
> >> to
> >>>>>>> solve them than it described at issues.
> >>>>>>>
> >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
> >>>> something
> >>>>>>> else :)
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> >>>>>>> [2]
> >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> >>>>>>> 7%3A+Ignite+internal+problems+detection
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message