ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Tue, 28 Nov 2017 07:28:56 GMT
Dmitry,

How these policies will be configured? Do you have any API in mind?

On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dmagda@apache.org> wrote:

> No objections here. Additional policies like EXEC might be added later
> depending on user needs.
>
> —
> Denis
>
> > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin.dvl@gmail.com>
> wrote:
> >
> > Denis,
> > I propose start with first three policies (it's already implemented, just
> > await some code combing, commit & review).
> > About of fourth policy (EXEC) I think that it's rather additional
> property
> > (some script path) than policy.
> >
> > 2017-11-23 0:43 GMT+03:00 Denis Magda <dmagda@apache.org>:
> >
> >> Just provide FailureProcessingPolicy with possible reactions:
> >> - NOOP - exceptions will be reported, metrics will be triggered but an
> >> affected Ignite process won’t be touched.
> >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
> >> process termination.
> >> - RESTART - NOOP actions + process restart.
> >> - EXEC - execute a custom script provided by the user.
> >>
> >> If needed the policy can be set per know failure such is OOM,
> Persistence
> >> errors so that the user can act accordingly basing on a context.
> >>
> >> —
> >> Denis
> >>
> >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <vozerov@gridgain.com>
> >> wrote:
> >>>
> >>> In the first iteration I would focus only on reporting facilities, to
> let
> >>> administrator spot dangerous situation. And in the second phase, when
> all
> >>> reporting and metrics are ready, we can think on some automatic
> actions.
> >>>
> >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
> >> mcherkasov@gridgain.com
> >>>> wrote:
> >>>
> >>>> Hi Anton,
> >>>>
> >>>> I don't think that we should shutdown node in case of
> >> IgniteOOMException,
> >>>> if one node has no space, then other probably  don't have it too, so
> re
> >>>> -balancing will cause IgniteOOM on all other nodes and will kill the
> >> whole
> >>>> cluster. I think for some configurations cluster should survive and
> >> allow
> >>>> to user clean cache or/and add more nodes.
> >>>>
> >>>> Thanks,
> >>>> Mikhail.
> >>>>
> >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov"
<
> >>>> avinogradov@gridgain.com> написал:
> >>>>
> >>>>> Igniters,
> >>>>>
> >>>>> Internal problems may and, unfortunately, cause unexpected cluster
> >>>>> behavior.
> >>>>> We should determine behavior in case any of internal problem
> happened.
> >>>>>
> >>>>> Well known internal problems can be split to:
> >>>>> 1) OOM or any other reason cause node crash
> >>>>>
> >>>>> 2) Situations required graceful node shutdown with custom
> notification
> >>>>> - IgniteOutOfMemoryException
> >>>>> - Persistence errors
> >>>>> - ExchangeWorker exits with error
> >>>>>
> >>>>> 3) Prefomance issues should be covered by metrics
> >>>>> - GC STW duration
> >>>>> - Timed out tasks and jobs
> >>>>> - TX deadlock
> >>>>> - Hanged Tx (waits for some service)
> >>>>> - Java Deadlocks
> >>>>>
> >>>>> I created special issue [1] to make sure all these metrics will
be
> >>>>> presented at WebConsole or VisorConsole (what's preferred?)
> >>>>>
> >>>>> 4) Situations required external monitoring implementation
> >>>>> - GC STW duration exceed maximum possible length (node should be
> >> stopped
> >>>>> before STW finished)
> >>>>>
> >>>>> All this problems were reported by different persons different time
> >> ago,
> >>>>> So, we should reanalyze each of them and, possible, find better
ways
> to
> >>>>> solve them than it described at issues.
> >>>>>
> >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
> >> something
> >>>>> else :)
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> >>>>> [2]
> >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> >>>>> 7%3A+Ignite+internal+problems+detection
> >>>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message