mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-1388) Inconsistent terminal task state between master and re-registering slave
Date Tue, 20 May 2014 01:03:59 GMT
Vinod Kone created MESOS-1388:
---------------------------------

             Summary: Inconsistent terminal task state between master and re-registering slave
                 Key: MESOS-1388
                 URL: https://issues.apache.org/jira/browse/MESOS-1388
             Project: Mesos
          Issue Type: Bug
            Reporter: Vinod Kone


The following is a sequence of events that could result in master sending TASK_LOST and then
TASK_FINISHED for a task to a framework.

--> Master failed over
--> Slaves tries to re-register with Master w/ a running task (T)
--> Master starts re-admission into the registry
--> Task finishes and slave removes it from its map
--> The TASK_FINISHED status update is dropped by master as re-admission is in progress
--> Slave retries re-registration (w/o task T) as master is still busy re-admitting it
and hasn't ACKed the re-registration yet
--> Master finally finishes re-admission and re-adds slave with task T
--> Master gets a duplicate/enqueued re-registration request (w/o task T) that results
in the master sending TASK_LOST during reconciliation.
--> Master now gets retried TASK_FINISHED update from the slave which it forwards to the
scheduler.


The crux of the issue is that the master doesn't know about tasks in terminal states that
belong to a re-registering slave. The right way to fix this issue is to have slave re-registering
with tasks that have pending terminal updates and possibly having ACKs go through the master.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message