mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avinash Sridharan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues
Date Mon, 25 Jul 2016 17:32:20 GMT

    [ https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392348#comment-15392348
] 

Avinash Sridharan commented on MESOS-5879:
------------------------------------------

Hi Silas,
 Had a discussion on this with [~jieyu]. I agree that the net_cls isolator should not be descending
into the child cgroups looking for net_cls handles and definitely a bug that we should fix.
We can use this JIRA to fix that issue.

As far as reuse by net_cls handle with other orchestrators (such as docker) in a different
hierarchy is concerned, the expectation is that the operator is responsible for slicing and
dicing the ranges between different orchestrator entities by specifying the primary handles
and the secondary handle range.

> cgroups/net_cls isolator causing agent recovery issues
> ------------------------------------------------------
>
>                 Key: MESOS-5879
>                 URL: https://issues.apache.org/jira/browse/MESOS-5879
>             Project: Mesos
>          Issue Type: Bug
>          Components: cgroups, isolation, slave
>            Reporter: Silas Snider
>            Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any agent process
in a cluster running an experimental custom isolator as well, the agents are unable to recover
from checkpoint, because net_cls reports that unknown orphan containers have duplicate net_cls
handles.
> While this is a problem that needs to be solved (probably by fixing our custom isolator),
it's also a problem that the net_cls isolator fails recovery just for duplicate handles in
cgroups that it is literally about to unconditionally destroy during recovery. Can this be
fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message