mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zach Carlson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-2122) MesosSchedulerDriver stop causes resource offer exhaustion
Date Tue, 18 Nov 2014 00:26:33 GMT

     [ https://issues.apache.org/jira/browse/MESOS-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zach Carlson updated MESOS-2122:
--------------------------------
    Affects Version/s:     (was: 0.21.0)

> MesosSchedulerDriver stop causes resource offer exhaustion
> ----------------------------------------------------------
>
>                 Key: MESOS-2122
>                 URL: https://issues.apache.org/jira/browse/MESOS-2122
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0, 0.20.1
>         Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
>            Reporter: Zach Carlson
>         Attachments: mesos_2122.py
>
>
> For additional consideration, see https://github.com/airbnb/chronos/issues/290 and https://github.com/mesosphere/marathon/issues/787
> When the SchedulerProcess managed by the MesosSchedulerDriver detects a master, it performs
a link() to the master. Libprocess proceeds to establish the link. Once the scheduler has
performed all the work necessary, it may call MesosSchedulerDriver.stop(failover = true).

> This is where things go awry: at this point, the SchedulerProcess schedules a termination
event for itself. When libprocess's schedule thread rolls through, it performs a cleanup()
of the SchedulerProcess, as expected. Part of the cleanup() is calling SocketManager::exited()
on the SchedulerProcess. The problem with this is that SocketManager::exited() cleans up the
links from the link map, but does not actually close the sockets. Now, since MesosSchedulerDriver::stop()
was called with failover = true, no DeregisterFramework message was sent, so the Mesos master
believes that the connection (which is still active) is still valid with a registered framework
listening for events. It sends resourceOffers to the 'valid' framework... and since there's
nothing actually listening for events, no response is sent, no offers are accepted or declined,
and Mesos will grind to a halt (*until version 0.21.0, which will (according to release notes)
rescind un-responded offers after a configurable timeout) -- no further offers made to any
framework, and when all current framework work has completed, no further work will be performed
due to the offers being wasted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message