mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
Date Fri, 24 Aug 2018 07:06:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235
] 

Qian Zhang edited comment on MESOS-8568 at 8/24/18 7:05 AM:
------------------------------------------------------------

I ran the exactly same reproduce steps with the above patch applied, and found this issue
was gone, there is only one check container's sandbox directory at any time.
{code:java}
$ ls -la /home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-0000/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers
| grep check | wc -l
{code}
Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before `REMOVE_NESTED_CONTAINER`
was called.
{code:java}
I0823 19:46:39.269901 32604 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call
for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module finished preparing
container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18;
IOSwitchboard server is required
I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 'mesos_executors.slice'
I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server (pid: 34701)
listening on socket file '/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c'
for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
and cloning with namespaces 
W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18:
Collect failed: ==========Fake error==========
I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
after 419840ns
I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42086
I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER call for container
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
after 265728ns
I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server
(pid: 34701) since container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
is being destroyed
I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process for container
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
has terminated (status=0)
I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42088
I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42090
I0823 19:46:47.428035 32636 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call
for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-f9d26e4a-aafd-427c-ac96-f4ddf050a13e'
{code}
Here is the default executor's stderr, we do not see any failed `REMOVE_NESTED_CONTAINER`
calls.
{code:java}
Marked '/' as rslave
I0823 19:46:34.180434 34636 executor.cpp:201] Version: 1.8.0
I0823 19:46:34.205943 34658 default_executor.cpp:204] Received SUBSCRIBED event
I0823 19:46:34.207974 34658 default_executor.cpp:208] Subscribed executor on core-dev
I0823 19:46:34.208364 34658 default_executor.cpp:204] Received LAUNCH_GROUP event
I0823 19:46:34.209259 34666 default_executor.cpp:428] Setting 'MESOS_CONTAINER_IP' to: 10.0.49.2
I0823 19:46:34.222025 34633 default_executor.cpp:204] Received ACKNOWLEDGED event
I0823 19:46:34.263268 34632 default_executor.cpp:663] Finished launching tasks [ test ] in
child containers [ 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349
]
I0823 19:46:34.263308 34632 default_executor.cpp:687] Waiting on child containers of tasks
[ test ]
I0823 19:46:34.264135 34642 default_executor.cpp:748] Waiting for child container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349
of task 'test'
I0823 19:46:34.309953 34657 default_executor.cpp:204] Received ACKNOWLEDGED event
W0823 19:46:39.304706 34669 checker_process.cpp:794] Received '400 Bad Request' (Collect failed:
==========Fake error==========) while launching COMMAND check for task 'test'
I0823 19:46:44.415793 34625 checker_process.cpp:457] COMMAND check for task 'test' is not
available
W0823 19:46:47.460309 34624 checker_process.cpp:794] Received '400 Bad Request' (Collect failed:
==========Fake error==========) while launching COMMAND check for task 'test'
I0823 19:46:52.663522 34633 checker_process.cpp:457] COMMAND check for task 'test' is not
available{code}
 


was (Author: qianzhang):
I ran the exactly same reproduce steps with the above patch applied, and found this issue
was gone, there is only one check container's sandbox directory at any time.
{code:java}
$ ls -la /home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-0000/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers
| grep check | wc -l
{code}
Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before `REMOVE_NESTED_CONTAINER`
was called.

 
{code:java}
I0823 19:46:39.269901 32604 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call
for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module finished preparing
container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18;
IOSwitchboard server is required
I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 'mesos_executors.slice'
I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server (pid: 34701)
listening on socket file '/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c'
for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
and cloning with namespaces 
W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18:
Collect failed: ==========Fake error==========
I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
after 419840ns
I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42086
I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER call for container
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
after 265728ns
I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server
(pid: 34701) since container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
is being destroyed
I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process for container
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
has terminated (status=0)
I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42088
I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42090
I0823 19:46:47.428035 32636 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call
for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-f9d26e4a-aafd-427c-ac96-f4ddf050a13e'
{code}
Here is the default executor's stderr, we do not see any failed `REMOVE_NESTED_CONTAINER`
calls.
{code:java}
Marked '/' as rslave
I0823 19:46:34.180434 34636 executor.cpp:201] Version: 1.8.0
I0823 19:46:34.205943 34658 default_executor.cpp:204] Received SUBSCRIBED event
I0823 19:46:34.207974 34658 default_executor.cpp:208] Subscribed executor on core-dev
I0823 19:46:34.208364 34658 default_executor.cpp:204] Received LAUNCH_GROUP event
I0823 19:46:34.209259 34666 default_executor.cpp:428] Setting 'MESOS_CONTAINER_IP' to: 10.0.49.2
I0823 19:46:34.222025 34633 default_executor.cpp:204] Received ACKNOWLEDGED event
I0823 19:46:34.263268 34632 default_executor.cpp:663] Finished launching tasks [ test ] in
child containers [ 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349
]
I0823 19:46:34.263308 34632 default_executor.cpp:687] Waiting on child containers of tasks
[ test ]
I0823 19:46:34.264135 34642 default_executor.cpp:748] Waiting for child container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349
of task 'test'
I0823 19:46:34.309953 34657 default_executor.cpp:204] Received ACKNOWLEDGED event
W0823 19:46:39.304706 34669 checker_process.cpp:794] Received '400 Bad Request' (Collect failed:
==========Fake error==========) while launching COMMAND check for task 'test'
I0823 19:46:44.415793 34625 checker_process.cpp:457] COMMAND check for task 'test' is not
available
W0823 19:46:47.460309 34624 checker_process.cpp:794] Received '400 Bad Request' (Collect failed:
==========Fake error==========) while launching COMMAND check for task 'test'
I0823 19:46:52.663522 34633 checker_process.cpp:457] COMMAND check for task 'test' is not
available{code}
 

> Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
> ------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8568
>                 URL: https://issues.apache.org/jira/browse/MESOS-8568
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Andrei Budnik
>            Assignee: Qian Zhang
>            Priority: Blocker
>              Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via `LAUNCH_NESTED_CONTAINER_SESSION`
in a checker library, it calls [waitNestedContainer |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] for
the container. Checker library [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
`REMOVE_NESTED_CONTAINER` to remove a previous nested container before launching a nested
container for a subsequent check. Hence, `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER`
to ensure that the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
`WAIT_NESTED_CONTAINER`. Despite the failure, the container might be launched and the following attempt
to remove the container without call `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal Server Error'
(Nested container has not terminated yet) while removing the nested container '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
used for the COMMAND check for task 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message