Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E62FB200C55 for ; Thu, 13 Apr 2017 20:46:46 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E4A3F160B98; Thu, 13 Apr 2017 18:46:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 381CD160B89 for ; Thu, 13 Apr 2017 20:46:46 +0200 (CEST) Received: (qmail 61883 invoked by uid 500); 13 Apr 2017 18:46:45 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 61873 invoked by uid 99); 13 Apr 2017 18:46:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Apr 2017 18:46:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0B1CCC1962 for ; Thu, 13 Apr 2017 18:46:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id joPdsN8vUhsk for ; Thu, 13 Apr 2017 18:46:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 110425FCE8 for ; Thu, 13 Apr 2017 18:46:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 515A4E0D57 for ; Thu, 13 Apr 2017 18:46:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id AB88C24071 for ; Thu, 13 Apr 2017 18:46:41 +0000 (UTC) Date: Thu, 13 Apr 2017 18:46:41 +0000 (UTC) From: "Neil Conway (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MESOS-7389) Check failed: frameworks_.contains(task.framework_id()) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 13 Apr 2017 18:46:47 -0000 [ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968051#comment-15968051 ] Neil Conway edited comment on MESOS-7389 at 4/13/17 6:46 PM: ------------------------------------------------------------- Interesting. Basic logic here: * Agent is re-registering with the master * The agent reports a list of the tasks it is running, and the frameworks that are running tasks on it * The assertion fires because there is a task running on the agent with a framework ID that is not in the list of frameworks the agent reported. Pre-1.0 Mesos agents _only_ report the tasks they are running, not the list of frameworks. Connecting pre-1.0 Mesos agents to 1.2.0 Mesos master is not _technically_ supported, but we don't actually guard against it just yet (MESOS-6975). So if the Mesos agent was actually running some pre-1.0 version of Mesos, that would explain the problem. Fixing the crash with pre-1.0 Mesos agents is probably worth doing regardless. If the agent was in fact running Mesos 1.0.1, something else is going on here. [~nicholasstudt] -- can you confirm that the agent in question was definitely running Mesos 1.0.1 when the problem was observed? was (Author: neilc): Interesting. Basic logic here: * Agent is re-registering with the master * The agent reports a list of the tasks it is running, and the frameworks that are running tasks on it * The assertion fires because there is a task running on the agent with a framework ID that is not in the list of frameworks the agent reported. Pre-1.0 Mesos agents _only_ report the tasks they are running, not the list of frameworks. Connecting pre-1.0 Mesos agents to 1.2.0 Mesos master is not _technically_ supported, but we don't actually guard against it just yet. So if the Mesos agent was actually running some pre-1.0 version of Mesos, that would explain the problem. If the agent was in fact running Mesos 1.0.1, something else is going on here. [~nicholasstudt] -- can you confirm that the agent in question was definitely running Mesos 1.0.1 when the problem was observed? > Check failed: frameworks_.contains(task.framework_id()) > ------------------------------------------------------- > > Key: MESOS-7389 > URL: https://issues.apache.org/jira/browse/MESOS-7389 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.2.0 > Environment: Ubuntu 14.04 > Reporter: Nicholas Studt > > During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with the running leader caused the leader to terminate. All 3 of the masters suffered the same failure as the same slave node reregistered against the new leader, this continued across the entire cluster until the offending slave node was removed and fixed. The fix to the slave node was to remove the mesos directory and then start the slave node back up. > F0412 17:24:42.736600 6317 master.cpp:5701] Check failed: frameworks_.contains(task.framework_id()) > *** Check failure stack trace: *** > @ 0x7f59f944f94d google::LogMessage::Fail() > @ 0x7f59f945177d google::LogMessage::SendToLog() > @ 0x7f59f944f53c google::LogMessage::Flush() > @ 0x7f59f9452079 google::LogMessageFatal::~LogMessageFatal() > I0412 17:24:42.750300 6316 replica.cpp:693] Replica received learned notice for position 6896 from @0.0.0.0:0 > @ 0x7f59f88f2341 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f59f88f488f _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f59f93c3eb1 process::ProcessManager::resume() > @ 0x7f59f93ccd57 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f59f77cfa60 (unknown) > @ 0x7f59f6fec184 start_thread > @ 0x7f59f6d19bed (unknown) -- This message was sent by Atlassian JIRA (v6.3.15#6346)