mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam B (JIRA)" <>
Subject [jira] [Updated] (MESOS-5352) Docker volume isolator cleanup can be blocked by first cleanup failure.
Date Thu, 19 Jan 2017 02:43:26 GMT


Adam B updated MESOS-5352:
    Target Version/s: 1.2.0
            Priority: Critical  (was: Major)

> Docker volume isolator cleanup can be blocked by first cleanup failure.
> -----------------------------------------------------------------------
>                 Key: MESOS-5352
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Priority: Critical
>              Labels: containerizer
> The summary title may be confusing, please look at the description below for details.
> Some background:
> 1). In docker volume isolator cleanup, currently we do reference counting for docker
volumes. Volume driver `unmount` will only be called if the ref count is 1. 
> 2). We have built a hash map `infos` to track on docker volume mount information for
one specific containerId. And a containerId will be erased form the hash map only if all driver
`unmount` calls succeed (each subprocess return a ready future).
> The issue in this JIRA is that if we have a slave running (not shut down or reboot in
this case), then keep launching frameworks which make use of docker volumes. Once any docker
volume isolator cleanup returns a failure, all the other `unmount` calls to these volumes
will be blocked by the reference count, since the `_cleanup()` returns a failure and the containerId
in the hash map `infos` is not erased even through all volume may be unmounted/detached correctly.
(docker volume isolator calls driver unmount as a subprocess, and a failure message may be
possibly returned by the driver even if all volumes are unmount/detached correctly). Then,
the extra containerId in infos could make all other isolator cleanup calls to contain one
extra volume when doing the reference counting, which mean it rejects to call driver unmount.
So after all tasks finish, all those docker volumes from the first failure will still with
the `attached` status.
> This issue will be gone after the slave recover, but we cannot rely on restarting the
slave every time hitting this case.

This message was sent by Atlassian JIRA

View raw message