ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: Critical worker threads liveness checking drawbacks
Date Mon, 24 Sep 2018 14:45:54 GMT
Andrey K. and G.,

Thanks, do we have a documentation ticket created? Prachi (copied) can help
with the documentation.

--
Denis

On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <agura@apache.org> wrote:

> Andrey,
>
> finally your change is merged to master branch. Congratulations and
> thank you very much! :)
>
> I think that the next step is feature that will allow signal about
> blocked threads to the monitoring tools via MXBean.
>
> I hope you will continue development of this feature and provide your
> vision in new JIRA issue.
>
>
> On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <stkuzma@gmail.com>
> wrote:
> >
> > David, Maxim!
> >
> > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> right
> > now: the scope is much broader than the scope of the change I implement.
> I
> > have had a talk to a group of Ignite commiters, and we agreed to complete
> > the change as follows.
> > - Blocking instructions in system-critical which may resonably last long
> > should be explicitly excluded from the monitoring.
> > - Failure handlers should have a setting to suppress some failures on
> > per-failure-type basis.
> > According to this I have updated the implementation: [1]
> >
> > [1] https://github.com/apache/ignite/pull/4089
> >
> > пн, 10 сент. 2018 г. в 22:35, David Harvey <syssoftsol@gmail.com>:
> >
> > > When I've done this before,I've needed to find the oldest  thread, and
> kill
> > > the node running that.   From a language standpoint, Maxim's "without
> > > progress" better than "heartbeat".   For example, what I'm most
> interested
> > > in on a distributed system is which thread started the work it has not
> > > completed the earliest, and when did that thread last make forward
> > > process.     You don't want to kill a node because a thread is waiting
> on a
> > > lock held by a thread that went off-node and has not gotten a response.
> > > If you don't understand the dependency relationships, you will make
> > > incorrect recovery decisions.
> > >
> > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <maxmuzaf@gmail.com>
> > > wrote:
> > >
> > > > I think we should find exact answers to these questions:
> > > >  1. What `critical` issue exactly is?
> > > >  2. How can we find critical issues?
> > > >  3. How can we handle critical issues?
> > > >
> > > > First,
> > > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > > >  - Long I/O operations (should be a configurable timeout for each
> type of
> > > > usage)
> > > >  - Infinite loops
> > > >  - Stalled\deadlocked threads (and\or too many parked threads,
> exclude
> > > I/O)
> > > >
> > > > Second,
> > > >  - The working queue is without progress (e.g. disco, exchange
> queues)
> > > >  - Work hasn't been completed since the last heartbeat (checking
> > > > milestones)
> > > >  - Too many system resources used by a thread for the long period of
> time
> > > > (allocated memory, CPU)
> > > >  - Timing fields associated with each thread status exceeded a
> maximum
> > > time
> > > > limit.
> > > >
> > > > Third (not too many options here),
> > > >  - `log everything` should be the default behaviour in all these
> cases,
> > > > since it may be difficult to find the cause after the restart.
> > > >  - Wait some interval of time and kill the hanging node (cluster
> should
> > > be
> > > > configured stable enough)
> > > >
> > > > Questions,
> > > >  - Not sure, but can workers miss their heartbeat deadlines if CPU
> loads
> > > up
> > > > to 80%-90%? Bursts of momentary overloads can be
> > > >     expected behaviour as a normal part of system operations.
> > > >  - Why do we decide that critical thread should monitor each other?
> For
> > > > instance, if all the tasks were blocked and unable to run,
> > > >     node reset would never occur. As for me, a better solution is to
> use
> > > a
> > > > separate monitor thread or pool (maybe both with software
> > > >     and hardware checks) that not only checks heartbeats but
> monitors the
> > > > other system as well.
> > > >
> > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <syssoftsol@gmail.com>
> wrote:
> > > >
> > > > > It would be safer to restart the entire cluster than to remove the
> last
> > > > > node for a cache that should be redundant.
> > > > >
> > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <agura@apache.org>
wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I agree with Yakov that we can provide some option that manage
> worker
> > > > > > liveness checker behavior in case of observing that some worker
> is
> > > > > > blocked too long.
> > > > > > At least it will  some workaround for cases when node fails
is
> too
> > > > > > annoying.
> > > > > >
> > > > > > Backups count threshold sounds good but I don't understand how
it
> > > will
> > > > > > help in case of cluster hanging.
> > > > > >
> > > > > > The simplest solution here is alert in cases of blocking of
some
> > > > > > critical worker (we can improve WorkersRegistry for this purpose
> and
> > > > > > expose list of blocked workers) and optionally call system
> configured
> > > > > > failure processor. BTW, failure processor can be extended in
> order to
> > > > > > perform any checks (e.g. backup count) and decide whether it
> should
> > > > > > stop node or not.
> > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> stkuzma@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > David, Yakov, I understand your fears. But liveness checks
deal
> > > with
> > > > > > > _critical_ conditions, i.e. when such a condition is met
we
> > > conclude
> > > > > the
> > > > > > > node as totally broken, and there is no sense to keep it
alive
> > > > > regardless
> > > > > > > the data it contains. If we want to give it a chance, then
the
> > > > > condition
> > > > > > > (long fsync etc.) should not considered as critical at
all.
> > > > > > >
> > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> yzhdanov@apache.org>:
> > > > > > >
> > > > > > > > Agree with David. We need to have an opporunity set
backups
> count
> > > > > > threshold
> > > > > > > > (at runtime also!) that will not allow any automatic
stop if
> > > there
> > > > > > will be
> > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > >
> > > > > > > > --Yakov
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >   Andrey Kuznetsov.
> > > > > >
> > > > >
> > > > --
> > > > --
> > > > Maxim Muzafarov
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message