mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Conway (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-7389) Check failed: frameworks_.contains(task.framework_id())
Date Thu, 13 Apr 2017 18:46:41 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968051#comment-15968051
] 

Neil Conway edited comment on MESOS-7389 at 4/13/17 6:46 PM:
-------------------------------------------------------------

Interesting. Basic logic here:

* Agent is re-registering with the master
* The agent reports a list of the tasks it is running, and the frameworks that are running
tasks on it
* The assertion fires because there is a task running on the agent with a framework ID that
is not in the list of frameworks the agent reported.

Pre-1.0 Mesos agents _only_ report the tasks they are running, not the list of frameworks.
Connecting pre-1.0 Mesos agents to 1.2.0 Mesos master is not _technically_ supported, but
we don't actually guard against it just yet (MESOS-6975). So if the Mesos agent was actually
running some pre-1.0 version of Mesos, that would explain the problem. Fixing the crash with
pre-1.0 Mesos agents is probably worth doing regardless.

If the agent was in fact running Mesos 1.0.1, something else is going on here.

[~nicholasstudt] -- can you confirm that the agent in question was definitely running Mesos
1.0.1 when the problem was observed?


was (Author: neilc):
Interesting. Basic logic here:

* Agent is re-registering with the master
* The agent reports a list of the tasks it is running, and the frameworks that are running
tasks on it
* The assertion fires because there is a task running on the agent with a framework ID that
is not in the list of frameworks the agent reported.

Pre-1.0 Mesos agents _only_ report the tasks they are running, not the list of frameworks.
Connecting pre-1.0 Mesos agents to 1.2.0 Mesos master is not _technically_ supported, but
we don't actually guard against it just yet. So if the Mesos agent was actually running some
pre-1.0 version of Mesos, that would explain the problem.

If the agent was in fact running Mesos 1.0.1, something else is going on here.

[~nicholasstudt] -- can you confirm that the agent in question was definitely running Mesos
1.0.1 when the problem was observed?

> Check failed: frameworks_.contains(task.framework_id())
> -------------------------------------------------------
>
>                 Key: MESOS-7389
>                 URL: https://issues.apache.org/jira/browse/MESOS-7389
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>         Environment: Ubuntu 14.04 
>            Reporter: Nicholas Studt
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with the running
leader caused the leader to terminate. All 3 of the masters suffered the same failure as the
same slave node reregistered against the new leader, this continued across the entire cluster
until the offending slave node was removed and fixed. The fix to the slave node was to remove
the mesos directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>      @     0x7f59f944f94d  google::LogMessage::Fail()
>      @     0x7f59f945177d  google::LogMessage::SendToLog()
>      @     0x7f59f944f53c  google::LogMessage::Flush()
>      @     0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice for position
6896 from @0.0.0.0:0 
>      @     0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>      @     0x7f59f88f488f  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>      @     0x7f59f93c3eb1  process::ProcessManager::resume()
>      @     0x7f59f93ccd57  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>      @     0x7f59f77cfa60  (unknown)
>      @     0x7f59f6fec184  start_thread
>      @     0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message