mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Rukletsov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-5396) After failover, master does not remove agents with same UPID.
Date Wed, 02 Nov 2016 11:48:58 GMT

     [ https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Rukletsov updated MESOS-5396:
---------------------------------------
    Summary: After failover, master does not remove agents with same UPID.  (was: After failover,
master does not remove agents with same UPID)

> After failover, master does not remove agents with same UPID.
> -------------------------------------------------------------
>
>                 Key: MESOS-5396
>                 URL: https://issues.apache.org/jira/browse/MESOS-5396
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not reregister) with
Mesos using the same UPID as the previous agent instance; this means it will get a new agent
ID
> * framework isn't notified about the status of the tasks on the *old* agentID until the
{{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to register
with the same UPID that was used by the previous agent instance, we know that a *reregistration*
attempt for the old <UPID, agentID> pair will never be seen. Hence we can declare the
old agentID to be gone-forever and notify frameworks appropriately, without waiting for the
full {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the master does
*not* failover (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message