Return-Path: X-Original-To: apmail-mesos-issues-archive@minotaur.apache.org Delivered-To: apmail-mesos-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0861E10A16 for ; Tue, 18 Nov 2014 00:26:34 +0000 (UTC) Received: (qmail 94560 invoked by uid 500); 18 Nov 2014 00:26:33 -0000 Delivered-To: apmail-mesos-issues-archive@mesos.apache.org Received: (qmail 94528 invoked by uid 500); 18 Nov 2014 00:26:33 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 94518 invoked by uid 99); 18 Nov 2014 00:26:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 00:26:33 +0000 Date: Tue, 18 Nov 2014 00:26:33 +0000 (UTC) From: "Zach Carlson (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MESOS-2122) MesosSchedulerDriver stop causes resource offer exhaustion MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-2122?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Carlson updated MESOS-2122: -------------------------------- Affects Version/s: (was: 0.21.0) > MesosSchedulerDriver stop causes resource offer exhaustion > ---------------------------------------------------------- > > Key: MESOS-2122 > URL: https://issues.apache.org/jira/browse/MESOS-2122 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.20.0, 0.20.1 > Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages) > Reporter: Zach Carlson > Attachments: mesos_2122.py > > > For additional consideration, see https://github.com/airbnb/chronos/issue= s/290 and https://github.com/mesosphere/marathon/issues/787 > When the SchedulerProcess managed by the MesosSchedulerDriver detects a m= aster, it performs a link() to the master. Libprocess proceeds to establish= the link. Once the scheduler has performed all the work necessary, it may = call MesosSchedulerDriver.stop(failover =3D true).=20 > This is where things go awry: at this point, the SchedulerProcess schedul= es a termination event for itself. When libprocess's schedule thread rolls = through, it performs a cleanup() of the SchedulerProcess, as expected. Part= of the cleanup() is calling SocketManager::exited() on the SchedulerProces= s. The problem with this is that SocketManager::exited() cleans up the link= s from the link map, but does not actually close the sockets. Now, since Me= sosSchedulerDriver::stop() was called with failover =3D true, no Deregister= Framework message was sent, so the Mesos master believes that the connectio= n (which is still active) is still valid with a registered framework listen= ing for events. It sends resourceOffers to the 'valid' framework... and sin= ce there's nothing actually listening for events, no response is sent, no o= ffers are accepted or declined, and Mesos will grind to a halt (*until vers= ion 0.21.0, which will (according to release notes) rescind un-responded of= fers after a configurable timeout) -- no further offers made to any framewo= rk, and when all current framework work has completed, no further work will= be performed due to the offers being wasted.=20 -- This message was sent by Atlassian JIRA (v6.3.4#6332)