ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Сорокин Дмитрий Владимирович <DVlSorokin....@sberbank.ru>
Subject Re: [!!Mass Mail]Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Tue, 28 Nov 2017 13:30:20 GMT
Vladimir,

These policies (policy, in fact) can be configured in IgniteConfiguration by calling setFailureProcessingPolicy(FailureProcessingPolicy
flrPlc) method.

--
Дмитрий Сорокин
Тел.: 8-789-13512
Моб.: +7 (916) 560-39-63


28.11.17, 10:28 пользователь "Vladimir Ozerov" <vozerov@gridgain.com> написал:

    Dmitry,

    How these policies will be configured? Do you have any API in mind?

    On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dmagda@apache.org> wrote:

    > No objections here. Additional policies like EXEC might be added later
    > depending on user needs.
    >
    > —
    > Denis
    >
    > > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin.dvl@gmail.com>
    > wrote:
    > >
    > > Denis,
    > > I propose start with first three policies (it's already implemented, just
    > > await some code combing, commit & review).
    > > About of fourth policy (EXEC) I think that it's rather additional
    > property
    > > (some script path) than policy.
    > >
    > > 2017-11-23 0:43 GMT+03:00 Denis Magda <dmagda@apache.org>:
    > >
    > >> Just provide FailureProcessingPolicy with possible reactions:
    > >> - NOOP - exceptions will be reported, metrics will be triggered but an
    > >> affected Ignite process won’t be touched.
    > >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
    > >> process termination.
    > >> - RESTART - NOOP actions + process restart.
    > >> - EXEC - execute a custom script provided by the user.
    > >>
    > >> If needed the policy can be set per know failure such is OOM,
    > Persistence
    > >> errors so that the user can act accordingly basing on a context.
    > >>
    > >> —
    > >> Denis
    > >>
    > >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <vozerov@gridgain.com>
    > >> wrote:
    > >>>
    > >>> In the first iteration I would focus only on reporting facilities, to
    > let
    > >>> administrator spot dangerous situation. And in the second phase, when
    > all
    > >>> reporting and metrics are ready, we can think on some automatic
    > actions.
    > >>>
    > >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
    > >> mcherkasov@gridgain.com
    > >>>> wrote:
    > >>>
    > >>>> Hi Anton,
    > >>>>
    > >>>> I don't think that we should shutdown node in case of
    > >> IgniteOOMException,
    > >>>> if one node has no space, then other probably  don't have it too,
so
    > re
    > >>>> -balancing will cause IgniteOOM on all other nodes and will kill
the
    > >> whole
    > >>>> cluster. I think for some configurations cluster should survive
and
    > >> allow
    > >>>> to user clean cache or/and add more nodes.
    > >>>>
    > >>>> Thanks,
    > >>>> Mikhail.
    > >>>>
    > >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton
Vinogradov" <
    > >>>> avinogradov@gridgain.com> написал:
    > >>>>
    > >>>>> Igniters,
    > >>>>>
    > >>>>> Internal problems may and, unfortunately, cause unexpected cluster
    > >>>>> behavior.
    > >>>>> We should determine behavior in case any of internal problem
    > happened.
    > >>>>>
    > >>>>> Well known internal problems can be split to:
    > >>>>> 1) OOM or any other reason cause node crash
    > >>>>>
    > >>>>> 2) Situations required graceful node shutdown with custom
    > notification
    > >>>>> - IgniteOutOfMemoryException
    > >>>>> - Persistence errors
    > >>>>> - ExchangeWorker exits with error
    > >>>>>
    > >>>>> 3) Prefomance issues should be covered by metrics
    > >>>>> - GC STW duration
    > >>>>> - Timed out tasks and jobs
    > >>>>> - TX deadlock
    > >>>>> - Hanged Tx (waits for some service)
    > >>>>> - Java Deadlocks
    > >>>>>
    > >>>>> I created special issue [1] to make sure all these metrics will
be
    > >>>>> presented at WebConsole or VisorConsole (what's preferred?)
    > >>>>>
    > >>>>> 4) Situations required external monitoring implementation
    > >>>>> - GC STW duration exceed maximum possible length (node should
be
    > >> stopped
    > >>>>> before STW finished)
    > >>>>>
    > >>>>> All this problems were reported by different persons different
time
    > >> ago,
    > >>>>> So, we should reanalyze each of them and, possible, find better
ways
    > to
    > >>>>> solve them than it described at issues.
    > >>>>>
    > >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
    > >> something
    > >>>>> else :)
    > >>>>>
    > >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
    > >>>>> [2]
    > >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
    > >>>>> 7%3A+Ignite+internal+problems+detection
    > >>>>>
    > >>>>
    > >>
    > >>
    >
    >


УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: Это электронное
сообщение и любые документы, приложенные к нему, содержат
конфиденциальную информацию. Настоящим уведомляем
Вас о том, что если это сообщение не предназначено
Вам, использование, копирование, распространение
информации, содержащейся в настоящем сообщении, а
также осуществление любых действий на основе этой
информации, строго запрещено. Если Вы получили это
сообщение по ошибке, пожалуйста, сообщите об этом
отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are
not the intended recipient you are notified that using, copying, distributing or taking any
action in reliance on the contents of this information is strictly prohibited. If you have
received this email in error please notify the sender and delete this email.
Mime
View raw message