mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Wu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.
Date Tue, 12 Apr 2016 00:18:25 GMT

     [ https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph Wu updated MESOS-5180:
-----------------------------
    Description: 
The existing implementation of the scheduler driver does not re-register with the master under
some network partition cases.

When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master

It is possible for either of these links to break *without* the master changing.  (Currently,
the scheduler driver will only re-register if the master changes).

If both links break or if just link (1) breaks, the master views the framework as {{inactive}}
and {{disconnected}}.  This means the framework will not receive any more events (such as
offers) from the master until it re-registers.  There is currently no way for the scheduler
to detect a one-way link breakage.

if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler usually
uses the link to send messages to the master, but libprocess will create another socket if
the persistent one is not available.

To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited`
event handler for the master's {{pid}} and re-register in this case.

See the related issue MESOS-5181 for link (1) breakage.

  was:
The existing implementation of the scheduler driver does not re-register with the master under
some network partition cases.

When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master

It is possible for either of these links to break *without* the master changing.  (Currently,
the scheduler driver will only re-register if the master changes).

If both links break or if just link (1) breaks, the master views the framework as {{inactive}}
and {{disconnected}}.  This means the framework will not receive any more events (such as
offers) from the master until it re-registers.  There is currently no way for the scheduler
to detect a one-way link breakage.

if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler usually
uses the link to send messages to the master, but libprocess will create another socket if
the persistent one is not available.

To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited`
event handler for the master's {{pid}} and re-register in this case.

See the related issue [TODO] for link (1) breakage.


> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-5180
>                 URL: https://issues.apache.org/jira/browse/MESOS-5180
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.0
>            Reporter: Joseph Wu
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with the master
under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master changing.  (Currently,
the scheduler driver will only re-register if the master changes).
> If both links break or if just link (1) breaks, the master views the framework as {{inactive}}
and {{disconnected}}.  This means the framework will not receive any more events (such as
offers) from the master until it re-registers.  There is currently no way for the scheduler
to detect a one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler
usually uses the link to send messages to the master, but libprocess will create another socket
if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited`
event handler for the master's {{pid}} and re-register in this case.
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message