mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Wu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-6286) Master does not remove an agent if it is responsive but not registered
Date Thu, 29 Sep 2016 20:24:20 GMT
Joseph Wu created MESOS-6286:
--------------------------------

             Summary: Master does not remove an agent if it is responsive but not registered
                 Key: MESOS-6286
                 URL: https://issues.apache.org/jira/browse/MESOS-6286
             Project: Mesos
          Issue Type: Bug
            Reporter: Joseph Wu
            Assignee: Neil Conway


As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The agent would
do the following in a loop:
1) Systemd starts the agent.
2) The agent detects the master, but does not connect yet.  The agent needs to recover first.
3) The agent is responsive to {{PingSlaveMessage}}s from the master.  But is stalled in recovery.
4) The agent is OOM-killed by the kernel before recovery finishes.  Repeat (1-4).

The consequences of this:
* Frameworks will never get a TASK_LOST or terminal status update for tasks on this agent.
* Executors on the agent can connect to the agent, but will not be able to register.

We should consider adding some timeout/intervention in the master for responsive, but non-recoverable
agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message