mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Megha Sharma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks
Date Mon, 07 Aug 2017 20:42:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16117229#comment-16117229
] 

Megha Sharma commented on MESOS-7215:
-------------------------------------

We have been talking about few approaches to address this and the one that made the most sense
was to completely get rid of the behavior of master killing non-partition aware tasks when
the unreachable agent re-registers. Here's the quick summary of our implementation.
As far as Mesos internals are concerned the tasks from non-partition aware frameworks are
to be treated the same way as partition aware tasks and one way to do it cleanly in Mesos
was to transition such non-partition aware tasks to TASK_UNREACHABLE state instead of TASK_LOST
when the agent becomes unreachable to make the Mesos internal bookkeeping simpler. Internally
such tasks will be in Framework#unreachableTasks cache instead of Framework#completedTasks
and their state would be TASK_UNREACHABLE but to be backwards compatible we will do some transformations
when the data is being exposed to the users or status update is sent to the framework so there
is no difference in how these tasks are presented before and after this patch.

Review Request: https://reviews.apache.org/r/61473/

> Race condition on re-registration of non-partition-aware frameworks
> -------------------------------------------------------------------
>
>                 Key: MESOS-7215
>                 URL: https://issues.apache.org/jira/browse/MESOS-7215
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: Yan Xu
>            Assignee: Megha Sharma
>            Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration after it
has been removed, the master only sends ShutdownFrameworkMessages to the agent for frameworks
that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending {{ShutdownFrameworkMessages}}
to the agent for all non-partition-aware frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same framework which
can then launch new tasks. The agent then receives tasks of the same framework and ignores
them because it thinks the framework is shutting down. The framework is not shutting down
of course, so from the master and the scheduler's perspective the task is pending in STAGING
forever until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the agent is assuming
the framework to be going away (and act accordingly) when it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message