aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Review Request 53418: Send SIGTERM to daemonized processes on shutdown.
Date Fri, 04 Nov 2016 19:44:20 GMT


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > RELEASE-NOTES.md, line 13
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553877#file1553877line13>
> >
> >     This should be elaborated a little bit more. It is confusing as Thermos processes
have a `daemon` property.

Good catch, I have updated it to have more detail.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > src/main/python/apache/thermos/common/process_util.py, lines 50-74
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553878#file1553878line50>
> >
> >     I would propose to wrap that entire method in a try-catch block so that this
method is effectively no-throw. The advantages:
> >     
> >     * Thermos will still work on Kernels older than 3.4 (even though those would
still be affected by the bug you have reported).
> >     * We minimize the risk of crashes due to bugs in this low level code.
> >     * We can provide propper logging here, as we have the necessary context to explain
what is going on and that it is safe to continue.

I've thought about this a lot, but kernel 3.4 was released in 2012: https://kernelnewbies.org/Linux_3.4

If folks are running a kernel older than that and running Mesos, I would be very surprised.
For reference here are the kernels shipping with some very old currently supported distros:
- Ubuntu 12.04 - Kernel 3.2 supported until April 2017
- Ubuntu 14.04 - Kernel 3.13 supported until April 2019
- CentOS/Red Hat 6 - Kernel Version 2.6.32. Released in 2010, supported until May 2017
- CentOS/Red Hat 7 - Kernel Version 3.10. Released in 2014, supported until 2020

Here are the support policies of related projects:
- Mesos (from https://mesos.apache.org/gettingstarted/ ): "For full support of process isolation
under Linux a recent kernel >=3.10 is required."
- Docker (from https://docs.docker.com/engine/installation/binaries/ ): "A 3.10 Linux kernel
is the minimum requirement for Docker."

If you feel very strongly about this, please file a JIRA and we can discuss this in a follow
up review before 0.17.0 is released. I don't strongly oppose this, but it seems like we are
bending backwards to support a case that won't exist.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > RELEASE-NOTES.md, line 14
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553877#file1553877line14>
> >
> >     We should try to make that optional. See below for details.

Dropped per my argument below.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > src/main/python/apache/thermos/core/process.py, lines 288-297
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553880#file1553880line288>
> >
> >     As mentioned above, I would propose to do the error handling in `setup_child_subreaping`
directly.
> 
> Joshua Cohen wrote:
>     I think this try/except is generally useful. It's something I've been meaning to
add for awhile, as I ran into the scenario frequenly while working on filesystem isolation
where the fork would fail and the task would go lost.

Not doing this now per not making the related change for error handling in `setup_child_subreaping`.

The try/accept is needed to ensure we send up the exepected exception so tasks fail with the
right message and have the right logging.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > src/test/sh/org/apache/aurora/e2e/test_daemonizing_process.aurora, line 38
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553882#file1553882line38>
> >
> >     Please add a small comment here for future readers. For example: "Assert that
Thermos does not lose track of double forking processes. On task teardown the daemonized process
should receive a signal to shut down cleanly."

Done, good catch.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53418/#review154913
-----------------------------------------------------------


On Nov. 3, 2016, 4:48 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53418/
> -----------------------------------------------------------
> 
> (Updated Nov. 3, 2016, 4:48 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1808
>     https://issues.apache.org/jira/browse/AURORA-1808
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> # Problem
> 
> Processes can deamonize and escape the supervision of a coordinator. Using the Docker
Containerizer or the Mesos Containerizer with pid isolation means that the processes will
be come reparented to the sh process that launches the executor. For example:
> 
> ```
> root@aurora:/# ps xf
>   PID TTY      STAT   TIME COMMAND
>    48 ?        Ss     0:00 /bin/bash
>    86 ?        R+     0:00  _ ps xf
>     1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va
>     5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag
>    23 ?        S      0:00  _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152 --
>    29 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
>    32 ?        S      0:00      |   _ /bin/bash -c      while true; do       echo hello
world       sleep 10     done
>    81 ?        S      0:00      |       _ sleep 10
>    31 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
>    33 ?        S      0:00          _ /bin/bash -c      while true; do       echo hello
world       sleep 10     done
>    82 ?        S      0:00              _ sleep 10
>    47 ?        S      0:00 python ./daemon.py
> ```
> 
> # Solution
> 
> Ensure processes that escape the supervision of the coordinator reparent to the runner
who can send signals to them on task tear down. We do this by using the `PR_SET_CHILD_SUBREAPER`
flag of `prctl(2)`.
> 
> After this change the process tree looks like:
> ```
> root@aurora:/# ps xf
>   PID TTY      STAT   TIME COMMAND
>    66 ?        Ss     0:00 /bin/bash
>    70 ?        R+     0:00  _ ps xf
>     1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va
>     5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag
>    23 ?        S      0:00  _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849 --
>    33 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
>    40 ?        S      0:00      |   _ /bin/bash -c      while true; do       echo hello
world       sleep 10     done
>    63 ?        S      0:00      |       _ sleep 10
>    36 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
>    37 ?        S      0:00      |   _ /bin/bash -c      while true; do       echo hello
world       sleep 10     done
>    62 ?        S      0:00      |       _ sleep 10
>    55 ?        S      0:00      _ python ./daemon.py
> 
> ```
> 
> Now the runner is aware of the reparented procesess can can tear it down cleanly with
a `SIGTERM`.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d89ef2f641373ac229be693a21a2c0111e1f241a 
>   src/main/python/apache/thermos/common/process_util.py abd2c0ef35858d13971319b0a7436ce2293824ce

>   src/main/python/apache/thermos/core/helper.py 68855e1e54ba1cd4456e18a36fb237ce6a468c34

>   src/main/python/apache/thermos/core/process.py 3ec43e2719ef97026f399c4b2aa23002559b3153

>   src/main/python/apache/thermos/core/runner.py 7b9013d11f6ff4172b6b7bf56e62299b0d11c977

>   src/test/sh/org/apache/aurora/e2e/test_daemonizing_process.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 67702d2c0f2e18ee10dcb798b6d421050bd7d4ca

> 
> Diff: https://reviews.apache.org/r/53418/diff/
> 
> 
> Testing
> -------
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message