mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-9307) Libprocess should have a way to detect stuck actor.
Date Thu, 11 Oct 2018 05:09:00 GMT
Jie Yu created MESOS-9307:
-----------------------------

             Summary: Libprocess should have a way to detect stuck actor.
                 Key: MESOS-9307
                 URL: https://issues.apache.org/jira/browse/MESOS-9307
             Project: Mesos
          Issue Type: Improvement
          Components: libprocess
            Reporter: Jie Yu


We spent two days on a bug, which turns out to be an infinite loop in an actor, blocking other
events from being processed by that actor.

Currently, the only way to know about a stuck agent is to use gdb. We should think about a
way to print error logs when an actor has stuck for more than a threshold.

For instance, Linux kernel will print a warning in kernel log if a task is stuck for more
than 120 seconds. Something like this will be extremely helpful.

Another way is to expose some metrics around this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message