mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-8762) Farmework Teardown Leaves Task in Uninterruptible Sleep State D
Date Fri, 20 Apr 2018 14:00:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445755#comment-16445755
] 

Karsten edited comment on MESOS-8762 at 4/20/18 1:59 PM:
---------------------------------------------------------

Take a look at this build https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/.
First check the console output. You'll see two processes in state D
{code}
root     12440  0.2  0.0  45380 13488 ?        D    02:48   0:00 python src/app_mock.py 33801
resident-pod-16322 2018-04-20T02:48:35.397Z http://www.example.com
root     12601  0.5  0.0  45380 13352 ?        D    02:48   0:00 python src/app_mock.py 33793
resident-pod-16322-fail 2018-04-20T02:48:37.678Z http://www.example.com
{code}

Now, download the logs and search for {{Process tree before teardown}}. You'll find all process
before we trigger the framework treardown on Mesos. You'll find PIDs {{12440}} and {{12440}}
as children of {{systemd 1}} *not* as children of {{mesos-agent,11783}} as I would expect.
The same is true for the tree from {{ps auxf}} in the console logs again. These are *after*
all tests ran.


was (Author: jeschkies):
Take a look at this build https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/.
First check the console output. You'll see two processes in state D
{code}
root     12440  0.2  0.0  45380 13488 ?        D    02:48   0:00 python src/app_mock.py 33801
resident-pod-16322 2018-04-20T02:48:35.397Z http://www.example.com
root     12601  0.5  0.0  45380 13352 ?        D    02:48   0:00 python src/app_mock.py 33793
resident-pod-16322-fail 2018-04-20T02:48:37.678Z http://www.example.com```
{code}

Now, download the logs and search for {Process tree before teardown}. You'll find all process
before we trigger the framework treardown on Mesos. You'll find PIDs {12440} and {12440} as
children of {systemd 1} *not* as children of {mesos-agent,11783} as I would expect. The same
is true for the tree from {ps auxf} in the console logs again. These are *after* all tests
ran.

> Farmework Teardown Leaves Task in Uninterruptible Sleep State D
> ---------------------------------------------------------------
>
>                 Key: MESOS-8762
>                 URL: https://issues.apache.org/jira/browse/MESOS-8762
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Karsten
>            Assignee: Till Toenshoff
>            Priority: Major
>         Attachments: UpgradeIntegrationTest.zip, happy-process-sandbox.zip, happy-process.trace,
master_agent.log, zombi-process.trace, zombie-process-sandbox.zip
>
>
> The Marathon has a testsuite that starts a Python simple HTTP server in a task group
aka pod in Marathon wit ha persistent volume. After the test run we call {{/teardown}} and
wait for the Marathon framework to be completed (see [MesosTest|https://github.com/mesosphere/marathon/blob/master/src/test/scala/mesosphere/marathon/integration/setup/MesosTest.scala#L311]).
>  
> Our CI checks whether we leak any tasks after all test runs. It turns out we do:
> {code}
> Will kill:
>   root     18084  0.0  0.0  45380 13612 ?        D    07:52   0:00 python src/app_mock.py
35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z http://www.example.com
> Running 'sudo kill -9 18084
> Wait for processes being killed...
> ...
> Couldn't kill some leaked processes:
>   root     18084  0.0  0.0  45380 13612 ?        D    07:52   0:00 python src/app_mock.py
35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z http://www.example.com
> ammonite.$file.ci.utils$StageException: Stage Compile and Test failed.
> {code}
> The attached Mesos master and agents logs (see attachment) show
> {code}
> Updating the state of task resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2
of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a-0000 (latest state: TASK_KILLED, status
update state: TASK_KILLED)
> {code}
> The executor logs (see zipped sandbox attached) shows
> {code}
> I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event
> I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down
> I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2
running in child container cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae
with SIGTERM signal
> I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation to SIGKILL
in 3secs from now
> {code}
> The task logs
> {code}
> 2018-04-06 07:52:36,620 INFO    : 2.7.13
> 2018-04-06 07:52:36,621 DEBUG   : ['src/app_mock.py', '35477', 'resident-pod-16322-fail',
'2018-04-06T07:52:16.924Z', 'http://www.example.com']
> 2018-04-06 07:52:36,621 INFO    : AppMock[resident-pod-16322-fail 2018-04-06T07:52:16.924Z]:
resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 has taken the
stage at port 35477. Will query http://www.example.com for health and readiness status.
> 2018-04-06 07:52:38,895 DEBUG   : Got GET request
> 172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 -
> {code}
> Compare these to a single task without a persistent volume of the same run
> {code}
> I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown
> I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event
> I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down
> I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree at pid
15591
> I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following process trees:
> [ 
> -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING; /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py
$PORT0 /app-156 v1 http://$HOST:0/%2Fapp-156/v1 
>  \--- 15612 python /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py
35417 /app-156 v1 http://172.16.10.198:0/%2Fapp-156/v1 
> ]
> I0406 07:52:40.029912 15585 executor.cpp:880] Scheduling escalation to SIGKILL in 3secs
from now
> 2018-04-06 07:52:40,029 WARNING : Received 15 signal. Closing the server...
> 2018-04-06 07:52:40,030 ERROR   : Socket.error in the main thread: 
> Traceback (most recent call last):
>   File "/home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py",
line 151, in <module>
>     httpd.serve_forever()
>   File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever
>     poll_interval)
>   File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry
>     return func(*args)
>   File "/usr/lib/python2.7/SocketServer.py", line 456, in fileno
>     return self.socket.fileno()
>   File "/usr/lib/python2.7/socket.py", line 228, in meth
>     return getattr(self._sock,name)(*args)
>   File "/usr/lib/python2.7/socket.py", line 174, in _dummy
>     raise error(EBADF, 'Bad file descriptor')
> error: [Errno 9] Bad file descriptor
> 2018-04-06 07:52:40,030 INFO    : Closing the server...
> {code}
> The server receives the {{SIGTERM}} and shuts down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message