mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-6483) Check failure when a 1.1 master marking a 0.28 agent as unreachable
Date Wed, 26 Oct 2016 17:52:58 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609167#comment-15609167
] 

Yan Xu commented on MESOS-6483:
-------------------------------

[~neilc] how should we handle an unreachable 0.28 agent with 1.1 master? This can happen when
you upgrade directly from 0.28 to 1.1, which in theory we should (try our best to) support.
Upgrading agents first (so it becomes 1.1 agents with 0.28 master) sounds workable but if
so we probably should add a section in upgrades.md?

> Check failure when a 1.1 master marking a 0.28 agent as unreachable
> -------------------------------------------------------------------
>
>                 Key: MESOS-6483
>                 URL: https://issues.apache.org/jira/browse/MESOS-6483
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Megha
>
> When upgrading directly from mesos version 0.28 to a version > 1.0 there could be
a scenario that may make the CHECK(frameworks.recovered.contains(frameworkId)) in Master::_markUnreachable(..)
to fail. The following sequence of events can happen.
> 1) The master gets upgraded first to the new version and the agent lets say X is still
at mesos version 0.28
> 2) This agent X (at mesos 0.28) attempts to re-registers with the master (at lets say
1.1) and as a result doesn't send the frameworks (frameworkInfos) in the ReRegisterSlave message
since it wasn't available in the older mesos version.
> 3) Among other frameworks on this agent X, is a framework Y which didn’t re-register
after master’s failover. Since the master builds the frameworks.recovered from the frameworkInfos
that agents provide it so this framework Y is neither in the recovered nor in registered frameworks.
> 4) The agent X post re-registering fails master’s health check and is being marked
unreachable by the master. The check  CHECK(frameworks.recovered.contains(frameworkId)) will
get fired for the framework Y since it is neither in recovered or registered but has tasks
running on the agent X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message