ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Wed, 29 Nov 2017 10:56:57 GMT
Dmitry,

Thank you, but how FailureProcessingPolicy looks like? It is not clear how
can I configure different reactions to different event types.

On Wed, Nov 29, 2017 at 1:47 PM, Дмитрий Сорокин <sbt.sorokin.dvl@gmail.com>
wrote:

> Vladimir,
>
> These policies (policy, in fact) can be configured in IgniteConfiguration
> by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc)
> method.
>
> 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
>
> > Denis,
> >
> > Yes, but can we look at proposed API before we dig into implementation?
> >
> > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dmagda@apache.org> wrote:
> >
> > > I think the failure processing policy should be configured via
> > > IgniteConfiguration in a way similar to the segmentation policies.
> > >
> > > —
> > > Denis
> > >
> > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <vozerov@gridgain.com>
> > > wrote:
> > > >
> > > > Dmitry,
> > > >
> > > > How these policies will be configured? Do you have any API in mind?
> > > >
> > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dmagda@apache.org>
> > wrote:
> > > >
> > > >> No objections here. Additional policies like EXEC might be added
> later
> > > >> depending on user needs.
> > > >>
> > > >> —
> > > >> Denis
> > > >>
> > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <
> > > sbt.sorokin.dvl@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> Denis,
> > > >>> I propose start with first three policies (it's already
> implemented,
> > > just
> > > >>> await some code combing, commit & review).
> > > >>> About of fourth policy (EXEC) I think that it's rather additional
> > > >> property
> > > >>> (some script path) than policy.
> > > >>>
> > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dmagda@apache.org>:
> > > >>>
> > > >>>> Just provide FailureProcessingPolicy with possible reactions:
> > > >>>> - NOOP - exceptions will be reported, metrics will be triggered
> but
> > an
> > > >>>> affected Ignite process won’t be touched.
> > > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP
+
> Ignite
> > > >>>> process termination.
> > > >>>> - RESTART - NOOP actions + process restart.
> > > >>>> - EXEC - execute a custom script provided by the user.
> > > >>>>
> > > >>>> If needed the policy can be set per know failure such is OOM,
> > > >> Persistence
> > > >>>> errors so that the user can act accordingly basing on a context.
> > > >>>>
> > > >>>> —
> > > >>>> Denis
> > > >>>>
> > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <
> > vozerov@gridgain.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> In the first iteration I would focus only on reporting
> facilities,
> > to
> > > >> let
> > > >>>>> administrator spot dangerous situation. And in the second
phase,
> > when
> > > >> all
> > > >>>>> reporting and metrics are ready, we can think on some
automatic
> > > >> actions.
> > > >>>>>
> > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
> > > >>>> mcherkasov@gridgain.com
> > > >>>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Anton,
> > > >>>>>>
> > > >>>>>> I don't think that we should shutdown node in case
of
> > > >>>> IgniteOOMException,
> > > >>>>>> if one node has no space, then other probably  don't
have it
> too,
> > so
> > > >> re
> > > >>>>>> -balancing will cause IgniteOOM on all other nodes
and will kill
> > the
> > > >>>> whole
> > > >>>>>> cluster. I think for some configurations cluster should
survive
> > and
> > > >>>> allow
> > > >>>>>> to user clean cache or/and add more nodes.
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Mikhail.
> > > >>>>>>
> > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь
"Anton Vinogradov" <
> > > >>>>>> avinogradov@gridgain.com> написал:
> > > >>>>>>
> > > >>>>>>> Igniters,
> > > >>>>>>>
> > > >>>>>>> Internal problems may and, unfortunately, cause
unexpected
> > cluster
> > > >>>>>>> behavior.
> > > >>>>>>> We should determine behavior in case any of internal
problem
> > > >> happened.
> > > >>>>>>>
> > > >>>>>>> Well known internal problems can be split to:
> > > >>>>>>> 1) OOM or any other reason cause node crash
> > > >>>>>>>
> > > >>>>>>> 2) Situations required graceful node shutdown
with custom
> > > >> notification
> > > >>>>>>> - IgniteOutOfMemoryException
> > > >>>>>>> - Persistence errors
> > > >>>>>>> - ExchangeWorker exits with error
> > > >>>>>>>
> > > >>>>>>> 3) Prefomance issues should be covered by metrics
> > > >>>>>>> - GC STW duration
> > > >>>>>>> - Timed out tasks and jobs
> > > >>>>>>> - TX deadlock
> > > >>>>>>> - Hanged Tx (waits for some service)
> > > >>>>>>> - Java Deadlocks
> > > >>>>>>>
> > > >>>>>>> I created special issue [1] to make sure all these
metrics will
> > be
> > > >>>>>>> presented at WebConsole or VisorConsole (what's
preferred?)
> > > >>>>>>>
> > > >>>>>>> 4) Situations required external monitoring implementation
> > > >>>>>>> - GC STW duration exceed maximum possible length
(node should
> be
> > > >>>> stopped
> > > >>>>>>> before STW finished)
> > > >>>>>>>
> > > >>>>>>> All this problems were reported by different persons
different
> > time
> > > >>>> ago,
> > > >>>>>>> So, we should reanalyze each of them and, possible,
find better
> > > ways
> > > >> to
> > > >>>>>>> solve them than it described at issues.
> > > >>>>>>>
> > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel
free to mention
> > > >>>> something
> > > >>>>>>> else :)
> > > >>>>>>>
> > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> > > >>>>>>> [2]
> > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > >>>>>>> 7%3A+Ignite+internal+problems+detection
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message