mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <>
Subject [jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.
Date Wed, 15 Nov 2017 23:34:00 GMT


Vinod Kone commented on MESOS-8185:

cc [~xujyan] [~megha.sharma] Is this still an issue after your changes?

> Tasks can be known to the agent but unknown to the master.
> ----------------------------------------------------------
>                 Key: MESOS-8185
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: Ilya Pronin
>            Assignee: Ilya Pronin
>              Labels: reliability
> Currently, when a master re-registers an agent that was marked unreachable, it shutdowns
all not partition-aware frameworks on that agent. When a master re-registers an agent that
is already registered, it doesn't check that all tasks from the slave's re-registration message
are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss {{SlaveReregisteredMessage}}
along with {{ShutdownFrameworkMessage}} and thus will not kill not partition-aware tasks.
But the master will mark the agent as registered and will not re-add tasks that it thought
will be killed. The agent may re-register again, this time successfully, before becoming marked
unreachable while never having terminated tasks of not partition-aware frameworks. The master
will simply forget those tasks ever existed, because it has "removed" them during the previous
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which don't get
to the agent because of the connection failure. Agent is marked registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just re-sends {{SlaveRegisteredMessage}}.
Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the already registered
agent reports in {{ReregisterSlaveMessage}} and the list of tasks the master has. In this
case anything that the master doesn't know about should not exist on the agent.

This message was sent by Atlassian JIRA

View raw message