mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <>
Subject [jira] [Updated] (MESOS-6483) Check failure when a 1.1 master marking a 0.28 agent as unreachable
Date Fri, 28 Oct 2016 18:09:00 GMT


Yan Xu updated MESOS-6483:
    Fix Version/s: 1.2.0

> Check failure when a 1.1 master marking a 0.28 agent as unreachable
> -------------------------------------------------------------------
>                 Key: MESOS-6483
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Megha
>            Assignee: Neil Conway
>             Fix For: 1.1.0, 1.2.0
> When upgrading directly from mesos version 0.28 to a version > 1.0 there could be
a scenario that may make the CHECK(frameworks.recovered.contains(frameworkId)) in Master::_markUnreachable(..)
to fail. The following sequence of events can happen.
> 1) The master gets upgraded first to the new version and the agent lets say X is still
at mesos version 0.28
> 2) This agent X (at mesos 0.28) attempts to re-registers with the master (at lets say
1.1) and as a result doesn't send the frameworks (frameworkInfos) in the ReRegisterSlave message
since it wasn't available in the older mesos version.
> 3) Among other frameworks on this agent X, is a framework Y which didn’t re-register
after master’s failover. Since the master builds the frameworks.recovered from the frameworkInfos
that agents provide it so this framework Y is neither in the recovered nor in registered frameworks.
> 4) The agent X post re-registering fails master’s health check and is being marked
unreachable by the master. The check  CHECK(frameworks.recovered.contains(frameworkId)) will
get fired for the framework Y since it is neither in recovered or registered but has tasks
running on the agent X.

This message was sent by Atlassian JIRA

View raw message