mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Bannier (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race
Date Thu, 15 Feb 2018 16:09:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Bannier reassigned MESOS-8524:
---------------------------------------

    Assignee:     (was: Benjamin Bannier)

> When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race

> ---------------------------------------------------------------------------------------
>
>                 Key: MESOS-8524
>                 URL: https://issues.apache.org/jira/browse/MESOS-8524
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>    Affects Versions: 1.5.0
>         Environment: Master + Agent running with enabled {{RESOURCE_PROVIDER}} capability
>            Reporter: Jan Schlicht
>            Priority: Major
>              Labels: mesosphere
>
> When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers with the master
it sends a {{UPDATE_SLAVE}} after being (re-)registered. In the master, the agent is added
(back) to the allocator, as soon as it's (re-)registered, i.e. before {{UPDATE_SLAVE}} is
being send. This triggers an allocation and offers might get sent out to frameworks. When
{{UPDATE_SLAVE}} is being handled in the master, these offers have to be rescinded, as they're
based on an outdated agent state.
> Internally, the allocator defers a offer callback in the master ({{Master::offer}}).
In rare cases a {{UPDATE_SLAVE}} message might arrive at the same time and its handler in
the master called before the offer callback (but after the actual allocation took place).
In this case the (outdated) offer is still sent to frameworks and never rescinded.
> Here's the relevant log lines, this was discovered while working on https://reviews.apache.org/r/65045/:
> {noformat}
> I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation for 1 agents
in 704915ns
> I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0
at slave(540)@172.18.8.20:60469 (172.18.8.20) with total oversubscribed resources {}
> I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to framework 53c557e7-3161-449b-bacc-a4f8c02e78e7-0000
(default) at scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
> I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 40444ns
> I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0
by disk[MOUNT]:200 (total), {  } (used)
> I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0
(172.18.8.20) updated with total resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message