ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kuznetsov <stku...@gmail.com>
Subject Re: Critical worker threads liveness checking drawbacks
Date Fri, 12 Oct 2018 05:29:04 GMT
Igniters,

Now I spot blocking / long-running code arising from
{{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
thread, see [1]. Ideally, all blocking operations along all possible code
paths should be guarded implicitly from critical failure detector to avoid
the thread from being considered blocked. There is a pull request [2] that
provides shallow solution. I didn't change code outside
{{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
upcoming change. Also, I didn't touch the code runnable by threads other
than partition-exchanger. So I have a number of guarded sections that are
wider than they could be, and this potentially hides issues from failure
detector. Does this PR make sense? Or maybe it's better to exclude
partition-exchanger from critical threads registry at all?

[1] https://issues.apache.org/jira/browse/IGNITE-9710
[2] https://github.com/apache/ignite/pull/4962


пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <maxmuzaf@gmail.com>:

> Andrey, Andrey
>
> > Thanks for being attentive! It's definitely a typo. Could you please
> create
> > an issue?
>
> I've created an issue [1] and prepared PR [2].
> Please, review this change.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9723
> [2] https://github.com/apache/ignite/pull/4862
>
> On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <yzhdanov@apache.org> wrote:
>
> > Config option + mbean access. Does that make sense?
> >
> > Yakov
> >
> > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
> >
> > > Then it should be config option.
> > >
> > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <agura@apache.org>:
> > >
> > > > Guys,
> > > >
> > > > why we need both config option and system property? I believe one way
> > is
> > > > enough.
> > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> nizhikov@apache.org>
> > > > wrote:
> > > > >
> > > > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > > > >
> > > > > Fixed version is 2.7.
> > > > >
> > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > > Nikolay, I agree, a user should be able to disable both thread
> > > liveness
> > > > > > check and checkpoint read lock timeout check from config and
a
> > system
> > > > > > property.
> > > > > >
> > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> nizhikov@apache.org
> > >:
> > > > > >
> > > > > > > Hello, Igniters.
> > > > > > >
> > > > > > > I found that this feature can't be disabled from config.
> > > > > > > The only way to disable it is from JMX bean.
> > > > > > >
> > > > > > > I think it very dangerous: If we have some corner case
or a bug
> > in
> > > > this
> > > > > > > Watch Dog it can make Ignite unusable.
> > > > > > > I propose to implement possibility to disable this feature
> both -
> > > > from
> > > > > > > config and from JVM options.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > > Maxim,
> > > > > > > >
> > > > > > > > Thanks for being attentive! It's definitely a typo.
Could you
> > > > please
> > > > > > >
> > > > > > > create
> > > > > > > > an issue?
> > > > > > > >
> > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov
<
> > > maxmuzaf@gmail.com
> > > > >:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
[1]
> > > > (master
> > > > > > >
> > > > > > > branch)
> > > > > > > > > exchange future wrapped
> > > > > > > > > with double `blockingSectionEnd` method. Is it
correct? I
> > just
> > > > want to
> > > > > > > > > understand this change and
> > > > > > > > > how should I use this in the future.
> > > > > > > > >
> > > > > > > > > Should I file a new issue to fix this? I think
here
> > > > > > >
> > > > > > > `blockingSectionBegin`
> > > > > > > > > method should be used.
> > > > > > > > >
> > > > > > > > > -------------
> > > > > > > > > blockingSectionEnd();
> > > > > > > > >
> > > > > > > > > try {
> > > > > > > > >     resVer = exchFut.get(exchTimeout,
> TimeUnit.MILLISECONDS);
> > > > > > > > > } finally {
> > > > > > > > >     blockingSectionEnd();
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > > > >
> > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur
<
> > > > daradurvs@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > > > >
> > > > > > > > > > I agree that wrapping of 'init' method reduces
the profit
> > of
> > > > watchdog
> > > > > > > > > > service in case of PME worker, but in other
cases, we
> > should
> > > > wrap all
> > > > > > > > > > possible long sections on GridDhtPartitionExchangeFuture.
> > For
> > > > example
> > > > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside
because it
> > may
> > > > take
> > > > > > > > > > significant time (reproducer attached).
> > > > > > > > > >
> > > > > > > > > > I only want to point out a possible issue
which may allow
> > to
> > > > end-user
> > > > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > > > >
> > > > > > > > > > I'm sure that PME experts know how to fix
this issue
> > > properly.
> > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey
Gura <
> > > agura@apache.org
> > > > >
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Vyacheslav,
> > > > > > > > > > >
> > > > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > > > GridDhtPartitionExchangeFuture#init
and it is ok.
> > Exchange
> > > > worker
> > > > > > >
> > > > > > > also
> > > > > > > > > > > shouldn't be blocked for long time
but in reality it
> > > > happens.It
> > > > > > >
> > > > > > > also
> > > > > > > > > > > means that your change doesn't make
sense.
> > > > > > > > > > >
> > > > > > > > > > > What actually make sense it is identification
of places
> > > which
> > > > > > > > > > > intentionally blocking. May be some
places/actions
> should
> > > be
> > > > > > >
> > > > > > > braced by
> > > > > > > > > > > blocking guards.
> > > > > > > > > > >
> > > > > > > > > > > If you have failing tests please make
sure that your
> > > > > > >
> > > > > > > failureHandler is
> > > > > > > > > > > NoOpFailureHandler or any other handler
with
> > > > ignoreFailureTypes =
> > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav
Daradur <
> > > > > > > > >
> > > > > > > > > daradurvs@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Igniters!
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for this important improvement!
> > > > > > > > > > > >
> > > > > > > > > > > > I've looked through implementation
and noticed that
> > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init
has not been
> > wrapped
> > > > in
> > > > > > >
> > > > > > > blocked
> > > > > > > > > > > > section. This means it easy to
halt the node in case
> of
> > > > > > >
> > > > > > > longrunning
> > > > > > > > > > > > actions during PME, for example
when we create a
> cache
> > > with
> > > > > > > > > > > > StoreFactrory which connect to
3rd party DB.
> > > > > > > > > > > >
> > > > > > > > > > > > I'm not sure that it is the right
behavior.
> > > > > > > > > > > >
> > > > > > > > > > > > I filled the issue [1] and prepared
the PR [2] with
> > > > reproducer
> > > > > > >
> > > > > > > and
> > > > > > > > > >
> > > > > > > > > > possible fix.
> > > > > > > > > > > >
> > > > > > > > > > > > Andrey, could you please look
at and confirm that it
> > > makes
> > > > sense?
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM
Andrey Kuznetsov <
> > > > > > >
> > > > > > > stkuzma@gmail.com>
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've created the ticket [1]
with short description
> of
> > > the
> > > > > > > > > >
> > > > > > > > > > functionality.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 24 сент. 2018 г.
в 17:46, Denis Magda <
> > > > dmagda@apache.org>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Andrey K. and G.,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks, do we have a
documentation ticket
> created?
> > > > Prachi
> > > > > > > > >
> > > > > > > > > (copied)
> > > > > > > > > > can help
> > > > > > > > > > > > > > with the documentation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Sep 24, 2018
at 5:51 AM Andrey Gura <
> > > > > > >
> > > > > > > agura@apache.org>
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Andrey,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > finally your change
is merged to master branch.
> > > > > > >
> > > > > > > Congratulations
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > > > > > > > thank you very
much! :)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think that the
next step is feature that will
> > > allow
> > > > > > >
> > > > > > > signal
> > > > > > > > > >
> > > > > > > > > > about
> > > > > > > > > > > > > > > blocked threads
to the monitoring tools via
> > MXBean.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I hope you will
continue development of this
> > > feature
> > > > and
> > > > > > > > >
> > > > > > > > > provide
> > > > > > > > > > your
> > > > > > > > > > > > > > > vision in new JIRA
issue.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Sep 11,
2018 at 6:54 PM Andrey
> Kuznetsov
> > <
> > > > > > > > > >
> > > > > > > > > > stkuzma@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks a lot
for you ideas. Unfortunately, I
> > > can't
> > > > adopt
> > > > > > >
> > > > > > > all
> > > > > > > > > >
> > > > > > > > > > of them
> > > > > > > > > > > > > > > right
> > > > > > > > > > > > > > > > now: the scope
is much broader than the scope
> > of
> > > > the
> > > > > > >
> > > > > > > change I
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > implement.
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > have had a
talk to a group of Ignite
> commiters,
> > > > and we
> > > > > > >
> > > > > > > agreed
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > > > > > > complete
> > > > > > > > > > > > > > > > the change
as follows.
> > > > > > > > > > > > > > > > - Blocking
instructions in system-critical
> > which
> > > > may
> > > > > > > > >
> > > > > > > > > resonably
> > > > > > > > > > last
> > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > should be
explicitly excluded from the
> > > monitoring.
> > > > > > > > > > > > > > > > - Failure
handlers should have a setting to
> > > > suppress some
> > > > > > > > > >
> > > > > > > > > > failures on
> > > > > > > > > > > > > > > > per-failure-type
basis.
> > > > > > > > > > > > > > > > According
to this I have updated the
> > > > implementation: [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пн, 10 сент.
2018 г. в 22:35, David Harvey <
> > > > > > > > > >
> > > > > > > > > > syssoftsol@gmail.com>:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > When
I've done this before,I've needed to
> > find
> > > > the
> > > > > > >
> > > > > > > oldest
> > > > > > > > > >
> > > > > > > > > > thread,
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > kill
> > > > > > > > > > > > > > > > > the node
running that.   From a language
> > > > standpoint,
> > > > > > > > >
> > > > > > > > > Maxim's
> > > > > > > > > > "without
> > > > > > > > > > > > > > > > > progress"
better than "heartbeat".   For
> > > > example, what
> > > > > > >
> > > > > > > I'm
> > > > > > > > > >
> > > > > > > > > > most
> > > > > > > > > > > > > > > interested
> > > > > > > > > > > > > > > > > in on
a distributed system is which thread
> > > > started the
> > > > > > >
> > > > > > > work
> > > > > > > > > >
> > > > > > > > > > it has
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > completed
the earliest, and when did that
> > > thread
> > > > last
> > > > > > >
> > > > > > > make
> > > > > > > > > >
> > > > > > > > > > forward
> > > > > > > > > > > > > > > > > process.
    You don't want to kill a node
> > > > because a
> > > > > > >
> > > > > > > thread
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > > lock
held by a thread that went off-node
> and
> > > has
> > > > not
> > > > > > > > >
> > > > > > > > > gotten a
> > > > > > > > > > > > > > response.
> > > > > > > > > > > > > > > > > If you
don't understand the dependency
> > > > relationships,
> > > > > > >
> > > > > > > you
> > > > > > > > > >
> > > > > > > > > > will make
> > > > > > > > > > > > > > > > > incorrect
recovery decisions.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon,
Sep 10, 2018 at 4:08 AM Maxim
> > > Muzafarov <
> > > > > > > > > >
> > > > > > > > > > maxmuzaf@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
I think we should find exact answers to
> > these
> > > > > > >
> > > > > > > questions:
> > > > > > > > > > > > > > > > > >
 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > > > > >
 2. How can we find critical issues?
> > > > > > > > > > > > > > > > > >
 3. How can we handle critical issues?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
First,
> > > > > > > > > > > > > > > > > >
 - Ignore uninterruptable actions (e.g.
> > > > > > >
> > > > > > > worker\service
> > > > > > > > > >
> > > > > > > > > > shutdown)
> > > > > > > > > > > > > > > > > >
 - Long I/O operations (should be a
> > > > configurable
> > > > > > >
> > > > > > > timeout
> > > > > > > > > >
> > > > > > > > > > for each
> > > > > > > > > > > > > > > type of
> > > > > > > > > > > > > > > > > >
usage)
> > > > > > > > > > > > > > > > > >
 - Infinite loops
> > > > > > > > > > > > > > > > > >
 - Stalled\deadlocked threads (and\or too
> > > many
> > > > parked
> > > > > > > > > >
> > > > > > > > > > threads,
> > > > > > > > > > > > > > > exclude
> > > > > > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
Second,
> > > > > > > > > > > > > > > > > >
 - The working queue is without progress
> > > (e.g.
> > > > disco,
> > > > > > > > > >
> > > > > > > > > > exchange
> > > > > > > > > > > > > > > queues)
> > > > > > > > > > > > > > > > > >
 - Work hasn't been completed since the
> > last
> > > > > > >
> > > > > > > heartbeat
> > > > > > > > > >
> > > > > > > > > > (checking
> > > > > > > > > > > > > > > > > >
milestones)
> > > > > > > > > > > > > > > > > >
 - Too many system resources used by a
> > thread
> > > > for the
> > > > > > > > >
> > > > > > > > > long
> > > > > > > > > > period
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > >
(allocated memory, CPU)
> > > > > > > > > > > > > > > > > >
 - Timing fields associated with each
> > thread
> > > > status
> > > > > > > > > >
> > > > > > > > > > exceeded a
> > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > >
limit.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
Third (not too many options here),
> > > > > > > > > > > > > > > > > >
 - `log everything` should be the default
> > > > behaviour
> > > > > > >
> > > > > > > in
> > > > > > > > >
> > > > > > > > > all
> > > > > > > > > > these
> > > > > > > > > > > > > > > cases,
> > > > > > > > > > > > > > > > > >
since it may be difficult to find the
> cause
> > > > after the
> > > > > > > > > >
> > > > > > > > > > restart.
> > > > > > > > > > > > > > > > > >
 - Wait some interval of time and kill
> the
> > > > hanging
> > > > > > >
> > > > > > > node
> > > > > > > > > >
> > > > > > > > > > (cluster
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > >
configured stable enough)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
Questions,
> > > > > > > > > > > > > > > > > >
 - Not sure, but can workers miss their
> > > > heartbeat
> > > > > > > > > >
> > > > > > > > > > deadlines if CPU
> > > > > > > > > > > > > > > loads
> > > > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > > >
to 80%-90%? Bursts of momentary overloads
> > can
> > > > be
> > > > > > > > > > > > > > > > > >
    expected behaviour as a normal part
> of
> > > > system
> > > > > > > > > >
> > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > >
 - Why do we decide that critical thread
> > > should
> > > > > > >
> > > > > > > monitor
> > > > > > > > > >
> > > > > > > > > > each other?
> > > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > >
instance, if all the tasks were blocked
> and
> > > > unable to
> > > > > > > > >
> > > > > > > > > run,
> > > > > > > > > > > > > > > > > >
    node reset would never occur. As for
> > me,
> > > a
> > > > better
> > > > > > > > > >
> > > > > > > > > > solution is
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > >
separate monitor thread or pool (maybe
> both
> > > > with
> > > > > > >
> > > > > > > software
> > > > > > > > > > > > > > > > > >
    and hardware checks) that not only
> > checks
> > > > > > >
> > > > > > > heartbeats
> > > > > > > > > >
> > > > > > > > > > but
> > > > > > > > > > > > > > > monitors the
> > > > > > > > > > > > > > > > > >
other system as well.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
On Mon, 10 Sep 2018 at 00:07 David
> Harvey <
> > > > > > > > > >
> > > > > > > > > > syssoftsol@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> It would be safer to restart the entire
> > > > cluster
> > > > > > >
> > > > > > > than to
> > > > > > > > > >
> > > > > > > > > > remove
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > last
> > > > > > > > > > > > > > > > > >
> node for a cache that should be
> > redundant.
> > > > > > > > > > > > > > > > > >
>
> > > > > > > > > > > > > > > > > >
> On Sun, Sep 9, 2018, 4:00 PM Andrey
> Gura
> > <
> > > > > > > > > >
> > > > > > > > > > agura@apache.org>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
>
> > > > > > > > > > > > > > > > > >
> > Hi,
> > > > > > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > > > > >
> > I agree with Yakov that we can
> provide
> > > some
> > > > > > >
> > > > > > > option
> > > > > > > > > >
> > > > > > > > > > that manage
> > > > > > > > > > > > > > > worker
> > > > > > > > > > > > > > > > > >
> > liveness checker behavior in case of
> > > > observing
> > > > > > >
> > > > > > > that
> > > > > > > > > >
> > > > > > > > > > some worker
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > >
> > blocked too long.
> > > > > > > > > > > > > > > > > >
> > At least it will  some workaround for
> > > > cases when
> > > > > > >
> > > > > > > node
> > > > > > > > > >
> > > > > > > > > > fails is
> > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > >
> > annoying.
> > > > > > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > > > > >
> > Backups count threshold sounds good
> > but I
> > > > don't
> > > > > > > > > >
> > > > > > > > > > understand how
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > >
> > help in case of cluster hanging.
> > > > > > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > > > > >
> > The simplest solution here is alert
> in
> > > > cases of
> > > > > > > > > >
> > > > > > > > > > blocking of
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > >
> > critical worker (we can improve
> > > > WorkersRegistry
> > > > > > >
> > > > > > > for
> > > > > > > > > >
> > > > > > > > > > this
> > > > > > > > > > > > > > purpose
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > >
> > expose list of blocked workers) and
> > > > optionally
> > > > > > >
> > > > > > > call
> > > > > > > > > >
> > > > > > > > > > system
> > > > > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > > >
> > failure processor. BTW, failure
> > processor
> > > > can be
> > > > > > > > > >
> > > > > > > > > > extended in
> > > > > > > > > > > > > > > order to
> > > > > > > > > > > > > > > > > >
> > perform any checks (e.g. backup
> count)
> > > and
> > > > decide
> > > > > > > > > >
> > > > > > > > > > whether it
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > >
> > stop node or not.
> > > > > > > > > > > > > > > > > >
> > On Sat, Sep 8, 2018 at 3:42 PM Andrey
> > > > Kuznetsov <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > stkuzma@gmail.com>
> > > > > > > > > > > > > > > > > >
> wrote:
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > David, Yakov, I understand your
> > fears.
> > > > But
> > > > > > >
> > > > > > > liveness
> > > > > > > > > >
> > > > > > > > > > checks
> > > > > > > > > > > > > > deal
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > >
> > > _critical_ conditions, i.e. when
> > such a
> > > > > > >
> > > > > > > condition
> > > > > > > > >
> > > > > > > > > is
> > > > > > > > > > met we
> > > > > > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > > > >
> the
> > > > > > > > > > > > > > > > > >
> > > node as totally broken, and there
> is
> > no
> > > > sense
> > > > > > >
> > > > > > > to
> > > > > > > > > >
> > > > > > > > > > keep it
> > > > > > > > > > > > > > alive
> > > > > > > > > > > > > > > > > >
> regardless
> > > > > > > > > > > > > > > > > >
> > > the data it contains. If we want to
> > > give
> > > > it a
> > > > > > > > > >
> > > > > > > > > > chance, then
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >
> condition
> > > > > > > > > > > > > > > > > >
> > > (long fsync etc.) should not
> > considered
> > > > as
> > > > > > >
> > > > > > > critical
> > > > > > > > > >
> > > > > > > > > > at all.
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > сб, 8 сент. 2018 г. в 15:18, Yakov
> > > > Zhdanov <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > yzhdanov@apache.org>:
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > > Agree with David. We need to have
> > an
> > > > > > >
> > > > > > > opporunity
> > > > > > > > > >
> > > > > > > > > > set backups
> > > > > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > >
> > threshold
> > > > > > > > > > > > > > > > > >
> > > > (at runtime also!) that will not
> > > allow
> > > > any
> > > > > > > > > >
> > > > > > > > > > automatic stop
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > >
> > will be
> > > > > > > > > > > > > > > > > >
> > > > a data loss. Andrey, what do you
> > > think?
> > > > > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > > > > >
> > > > --Yakov
> > > > > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > --
> > > > > > > > > > > > > > > > > >
> > > Best regards,
> > > > > > > > > > > > > > > > > >
> > >   Andrey Kuznetsov.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
--
> > > > > > > > > > > > > > > > > >
--
> > > > > > > > > > > > > > > > > >
Maxim Muzafarov
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > >   Andrey Kuznetsov.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > >   Andrey Kuznetsov.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > --
> > > > > > > > > Maxim Muzafarov
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > >
> > >
> >
> --
> --
> Maxim Muzafarov
>


-- 
Best regards,
  Andrey Kuznetsov.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message