mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrei Budnik (JIRA)" <>
Subject [jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.
Date Mon, 30 Oct 2017 18:37:00 GMT


Andrei Budnik commented on MESOS-7506:

Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like this|],
where the clock is advanced by {{executor_registration_timeout}} and then it waits in a loop
until a task status update is sent. This loop is executing while the container is being destroyed.
At the same time, container destruction consists of multiple steps, one of them waits for
[cgroups destruction|].
That means, we have a race between container destruction process and the loop that advances
the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. {{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction completes. That results
in [leaving orphaned|]
containers that will be detected by [Slave destructor|]
in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in the loop,
which waits for a status update.

> Multiple tests leave orphan containers.
> ---------------------------------------
>                 Key: MESOS-7506
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: containerizer, flaky-test, mesosphere
> I've observed a number of flaky tests that leave orphan containers upon cleanup. A typical
log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}

This message was sent by Atlassian JIRA

View raw message