mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhitao Li (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-8353) Duplicate task for same framework on multiple agents crashes out master after failover
Date Thu, 21 Dec 2017 04:25:00 GMT
Zhitao Li created MESOS-8353:
--------------------------------

             Summary: Duplicate task for same framework on multiple agents crashes out master
after failover
                 Key: MESOS-8353
                 URL: https://issues.apache.org/jira/browse/MESOS-8353
             Project: Mesos
          Issue Type: Bug
            Reporter: Zhitao Li


We have seen a mesos master crash loop after a leader failover. After more investigation,
it seems that a same task ID was managed to be created onto multiple Mesos agents in the cluster.


One possible logical sequence which can lead to such problem:

1. Task T1 was launched to master M1 on agent A1 for framework F;
2. Master M1 failed over to M2;
3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 does not know
previous T1 yet so it accepted it and sent to A2;
4. A1 reregistered: this probably crashed M2 (because same task cannot be added twice);
5. When M3 tries to come up after M2, it further crashes because both A1 and A2 tried to add
a T1 to the framework.

(I only have logs to prove the last step right now)

This happened on 1.4.0 masters.

Although this is probably triggered by incorrect retry logic on framework side, I wonder whether
Mesos master should do extra protection to prevent such issue to happen. One possible idea
to instruct one of the agents carrying tasks w/ duplicate ID to terminate corresponding tasks,
or just refuse to reregister such agents and instruct them to shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message