mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7181) Stale frameworks seen on Mesos, but not known to scheduler
Date Wed, 22 Mar 2017 01:59:41 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935651#comment-15935651
] 

Yan Xu commented on MESOS-7181:
-------------------------------

Could you elaborate on how you see it being implemented?

Currently the semantics of {{ExitedEvent}} is around "persistent connections". So the sender
would know if {{ExitedEvent}} should be generated but the receiver doesn't. Would you like
to add another type of event?

Currently libprocess sends a {{NotFound}} for [non-libprocess clients|https://github.com/apache/mesos/blob/05e9a1d40572b8383a582e15663d861b134a7dad/3rdparty/libprocess/src/process.cpp#L2849].
So I imagine we would need to generalize it for libprocess client (there's a note about compatibility
issues). Right now "libprocess will ignore responses", we would need to change that too. Is
this what you had in mind?

> Stale frameworks seen on Mesos, but not known to scheduler
> ----------------------------------------------------------
>
>                 Key: MESOS-7181
>                 URL: https://issues.apache.org/jira/browse/MESOS-7181
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>
> Using a scheduler which launches multiple frameworks using scheduler driver, we observe
occasionally that a framework exists on Mesos which is not known to the scheduler. Since there
is no entity that acts on the offers, this framework ends up hogging all the offers leading
to starvation in the cluster.
> This particular scenario is as follows:
> 1) Scheduler does a driver.start() which results in the 1st SUBSCRIBE sent to master.
> 2) The scheduler driver resends the SUBSCRIBE (since the framework has not yet registered)
which is a result of the exponential backoff.
> 3) Framework is registered based on the 1st SUBSCRIBE, but the scheduler issues a driver.stop()
immediately which results in a TEARDOWN sent to the master.
> 4) Master processes the TEARDOWN which removes the framework.
> 5) Master now processes the 2nd SUBSCRIBE (after authorization) and tries to add this
framework. This succeeds and a new framework id is generated (since the original framework
is no longer registered after the TEARDOWN) but the Scheduler driver by now has already terminated
once the scheduler issued the driver.stop(). So, master continues to send offers to this 2nd
framework and hogs on to offers till offer time out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message