ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Automatic Handling of Long Stop-the-World Pauses
Date Thu, 21 Jun 2018 22:52:36 GMT

It's a pleasure to see how our project is evolving in a directing of being
a self-healing solution:

   - Ignite can already handle critical failures such as OOM, File I/O
   issues, etc. [1]
   - There is an endeavor to fix cluster lock-ins due to partition map
   exchange issues. [2]

There is one more notorious problem that might affect Ignite deployments
which is long stop-the-world GC pauses.

I know we did a little progress in this direction [3] by providing
particular metrics that help to monitor the pauses. Why don't we keep the
pace and teach Ignite to help itself if it sees there is a node that brings
down overall cluster performance due to an STP?

I would create policies similar to the critical failures policies [4] or
just add a long STP to the list of critical failures and reuse existing

Thoughts? Anyone who'd like to implement the feature?

[1] https://apacheignite.readme.io/docs/critical-failures-handling
[3] https://issues.apache.org/jira/browse/IGNITE-6171

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message