Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 33C6B10CFE for ; Tue, 30 Jul 2013 17:19:51 +0000 (UTC) Received: (qmail 4824 invoked by uid 500); 30 Jul 2013 17:19:50 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 4388 invoked by uid 500); 30 Jul 2013 17:19:50 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 4115 invoked by uid 99); 30 Jul 2013 17:19:50 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jul 2013 17:19:50 +0000 Date: Tue, 30 Jul 2013 17:19:50 +0000 (UTC) From: "Vinod Kone (JIRA)" To: dev@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (MESOS-530) A registered slave should check registration id when it receives mulitple re(re-)gistered messages from the master MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone resolved MESOS-530. ------------------------------ Resolution: Fixed > A registered slave should check registration id when it receives mulitple re(re-)gistered messages from the master > ------------------------------------------------------------------------------------------------------------------ > > Key: MESOS-530 > URL: https://issues.apache.org/jira/browse/MESOS-530 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.13.0 > Reporter: Vinod Kone > Assignee: Vinod Kone > Fix For: 0.13.0 > > > We have seen this in production at Twitter > Timeline of events: > 06/26 06:46: Slave host rebooted > 06/26 06:46.54: Slave registered (201305082239-1864771594-5050-8729-594) with master. But slave never got the ACK. Presumably there were network partition issues. > 06/26 06:47.21: Slave disconnected from the master and the master removed the slave. > 06/26 06:47.21: Immediately after, the master registered the slave with a new id (201305082239-1864771594-5050-8729-609). > But the slave only received the old id (201305082239-1864771594-5050-8729-594)!!! So there is a mismatch between the slave id known to the master and the slave! > Currently the slave silently ignores a (re-)registration message from the master if it is already (re-)registered. This was originally designed to ignore duplicate (re-)registered messages sent by the master. But clearly it doesn't catch the above edge case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira