ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kuznetsov <stku...@gmail.com>
Subject Critical worker threads liveness checking drawbacks
Date Thu, 06 Sep 2018 14:02:47 GMT

Currently, we have a nearly completed implementation for system-critical
threads liveness checking [1], in terms of IEP-14 [2] and IEP-5 [3]. In a
nutshell, system-critical threads monitor each other and checks for two
- whether a thread is alive;
- whether a thread is active, i.e. it updates its heartbeat timestamp
When either check fails, critical failure handler is called, this in fact
means node stop.

The implementation of activity checks has a flaw now: some blocking actions
are parts of normal operation and should not lead to node stop, e.g.
- WAL writer thread can call {{fsync()}};
- any cache write that occurs in system striped executor can lead to
{{fsync()}} call again.
The former example can be fixed by disabling heartbeat checks temporarily
for known long-running actions, but it won't work with for the latter one.

I see a few options to address the issue:
- Just log any long-running action instead of calling critical failure
- Introduce several severity levels for long-running actions handling. Each
level will have its own failure handler. Depending on the level,
long-running action can lead to node stop, error logging or no-op reaction.

I encourage you to suggest other options. Any idea is appreciated.

[1] https://issues.apache.org/jira/browse/IGNITE-6587

Best regards,
  Andrey Kuznetsov.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message