ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yakov Zhdanov <yzhda...@apache.org>
Subject Re: Critical worker threads liveness checking drawbacks
Date Fri, 07 Sep 2018 15:44:24 GMT

I don't understand your point. My opinion, the idea of these changes is to
make cluster more stable and responsive by eliminating hanged nodes. I
would not make too much difference between threads trapped in deadlock and
threads hanging on fsync calls for too long. Both situations lead to
increasing latency in cluster till its full unavailability.

So, killing node hanging on fsync may be reasonable. Agree?

You may implement the approach when you have warning messages in logs by
default, but termination option should also be available.



2018-09-06 17:02 GMT+03:00 Andrey Kuznetsov <stkuzma@gmail.com>:

> Igniters,
> Currently, we have a nearly completed implementation for system-critical
> threads liveness checking [1], in terms of IEP-14 [2] and IEP-5 [3]. In a
> nutshell, system-critical threads monitor each other and checks for two
> aspects:
> - whether a thread is alive;
> - whether a thread is active, i.e. it updates its heartbeat timestamp
> periodically.
> When either check fails, critical failure handler is called, this in fact
> means node stop.
> The implementation of activity checks has a flaw now: some blocking actions
> are parts of normal operation and should not lead to node stop, e.g.
> - WAL writer thread can call {{fsync()}};
> - any cache write that occurs in system striped executor can lead to
> {{fsync()}} call again.
> The former example can be fixed by disabling heartbeat checks temporarily
> for known long-running actions, but it won't work with for the latter one.
> I see a few options to address the issue:
> - Just log any long-running action instead of calling critical failure
> handler.
> - Introduce several severity levels for long-running actions handling. Each
> level will have its own failure handler. Depending on the level,
> long-running action can lead to node stop, error logging or no-op reaction.
> I encourage you to suggest other options. Any idea is appreciated.
> [1] https://issues.apache.org/jira/browse/IGNITE-6587
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
> [3]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74683878
> --
> Best regards,
>   Andrey Kuznetsov.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message