mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benno Evers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
Date Thu, 31 May 2018 13:23:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496526#comment-16496526
] 

Benno Evers commented on MESOS-7966:
------------------------------------

> I wasn't aware that Marathon had its own reasons for doing dynamic reservations. Do you
have any details you can share on why it does or a link to some code?

I was just basing this on the following log lines, and the fact that marathon is the only
framework ever mentioned as receiving inverse offers.
{noformat}
I0502 15:00:57.588295 20632 master.cpp:7769] Sending 1 inverse offers to framework 487b53f1-1a44-44b5-bf9f-24790937b51a-0001
(marathon1) at scheduler-e96a9f61-720c-4c0c-9018-60224ab59031@10.65.137.102:40886
{noformat}

Actually, on re-reading the allocator code, it seems that it is enough for a framework to
use any resources on the host scheduled for maintenance, so the focus on reservations was
probably a bit of a red herring. It shouldn't change anything about the underlying race, though.

> check for maintenance on agent causes fatal error
> -------------------------------------------------
>
>                 Key: MESOS-7966
>                 URL: https://issues.apache.org/jira/browse/MESOS-7966
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Rob Johnson
>            Assignee: Benno Evers
>            Priority: Critical
>              Labels: mesosphere, reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully draining agents
of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with the api.
This happens relatively frequently, and impacts us when downstream frameworks (marathon) react
badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're happy to
provide any other logs you need - please let me know what would be useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message