mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-8877) Docker container's resources will be wrongly enlarged in cgroups after agent recovery
Date Fri, 11 May 2018 01:28:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462217#comment-16462217
] 

Qian Zhang edited comment on MESOS-8877 at 5/11/18 1:27 AM:
------------------------------------------------------------

The root cause of this issue, when we recover a container in `DockerContainerizerProcess::_recover`,
the resources of the container is NOT set (see [here|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.cpp#L1013:L1051] for
details), this will cause when the Docker executor reregisters with agent, `DockerContainerizerProcess::__update`
will be called to update the resources of the Docker container in cgroups because the result
of [this check|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.cpp#L1652] is
not true, and the updated resources include both the Docker container's resources and the
Docker executor's resources (0.1 cpus and 32 MB memory). That's why we see the Docker container's
resources in cgroups are enlarged by 0.1 cpus and 32 MB memory after agent recovery.

We do not have this issue when launching Docker container in the first place, because its
resources will be set (see [here|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.hpp#L343] for
details), and it contains both the Docker container's resources and the Docker executor's
resources, so the result of [this check|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.cpp#L1652] will
be true which will cause `DockerContainerizerProcess::__update` will not be called.


was (Author: qianzhang):
The root cause of this issue, when we recover a container in `DockerContainerizerProcess::_recover`,
the resources of the container is NOT set (see [here|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.cpp#L1013:L1051] for
details), this will cause when the Docker executor reregisters with agent, `DockerContainerizerProcess::__update`
will be called to update the resources of the Docker container in cgroups because the result
of [this check|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.cpp#L1652] is
not true, and the updated resources include both the Docker container's resources and the
Docker executor's resources (0.1 cpus and 32 MB memory). That's why we see the Docker container's
resources in cgroups are enlarged by 0.1 cpus and 32 MB memory after agent recovery.

We do not have this issue when launching Docker container, because its resources will be
set (see [here|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.hpp#L343] for
details), and it contains both the Docker container's resources and the Docker executor's
resources, so the result of [this check|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/docker.cpp#L1652] will
be true which will cause `DockerContainerizerProcess::__update` will not be called.

> Docker container's resources will be wrongly enlarged in cgroups after agent recovery
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-8877
>                 URL: https://issues.apache.org/jira/browse/MESOS-8877
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>            Reporter: Qian Zhang
>            Priority: Major
>
> Reproduce steps:
> 1. Run `mesos-execute --master=10.0.49.2:5050 --task=[file:///home/qzhang/workspace/config/task_docker.json]
--checkpoint=true` to launch a Docker container.
> {code:java}
> # cat task_docker.json 
> {
>   "name": "test",
>   "task_id": {"value" : "test"},
>   "agent_id": {"value" : ""},
>   "resources": [
>     {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
>     {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
>   ],
>   "command": {
>     "value": "sleep 55555"
>   },
>   "container": {
>     "type": "DOCKER",
>     "docker": {
>       "image": "alpine"
>     }
>   }
> }
> {code}
> 2. When the Docker container is running, we can see its resources in cgroups are correctly
set, so far so good.
> {code:java}
> # cat /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/cpu.cfs_quota_us

> 10000
> # cat /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/memory.limit_in_bytes

> 33554432
> {code}
> 3. Restart Mesos agent, and then we will see the resources of the Docker container will
be wrongly enlarged.
> {code}
> I0503 02:06:17.268340 29512 docker.cpp:1855] Updated 'cpu.shares' to 204 at /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106
for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> I0503 02:06:17.271390 29512 docker.cpp:1882] Updated 'cpu.cfs_period_us' to 100ms and
'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> I0503 02:06:17.273082 29512 docker.cpp:1924] Updated 'memory.soft_limit_in_bytes' to
64MB for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> I0503 02:06:17.275908 29512 docker.cpp:1950] Updated 'memory.limit_in_bytes' to 64MB
at /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106
for container 1b21295b-2f49-4d08-84c7-43b9ae15ad88
> # cat /sys/fs/cgroup/cpu,cpuacct/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/cpu.cfs_quota_us
> 20000
> # cat /sys/fs/cgroup/memory/docker/a711b3c7b0d91cd6d1c7d8daf45a90ff78d2fd66973e615faca55a717ec6b106/memory.limit_in_bytes
> 67108864
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message