mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilbert Song (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
Date Wed, 28 Mar 2018 19:42:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417999#comment-16417999
] 

Gilbert Song commented on MESOS-8489:
-------------------------------------

[~abudnik], thanks for the triaging. However, I think we did not understand this issue deep
enough:
# The race description seems not accurate enough to me. The race is between the destruction
of the first cluster::slave and the orphan container destroy in the second slave's recovery
path. We should reset the Owned pointer first before we call next StartSlave(). (This would
fix the flakiness in this unit test)
# We need to understand why the nested *test* cgroup is still there when we create the first
slave, since it is just a simple os::rmdir(). This is the trigger of the flakiness. The *test*
cgroup is supposed to be created and removed immediately. There might be a bug in cgroup::remove().
https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L485
# The nested *test* cgroup may no longer be needed since it was a workaround for old kernel
versions. Could you do some investigations on whether this is supported by kenel version later
than 2.6? We may be able to remove these code and document it (Still need to understand #2
though). https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L461~#L488

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --------------------------------------------------------------
>
>                 Key: MESOS-8489
>                 URL: https://issues.apache.org/jira/browse/MESOS-8489
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Andrei Budnik
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] "test"
cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
--gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
leads to calling [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
Finally, we get to the point, where we try to create a ["test" container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. So,
the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an attempt to
create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message