aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@uber.com>
Subject Aurora, Thermos, PID 1, and You
Date Tue, 01 Nov 2016 01:42:14 GMT
Hey,

Recently I have experienced a number of issues in a production environment
with the DockerContainerizer, Aurora and Thermos. Although my experience is
specific to Docker, I believe this applies to anyone using the Mesos
Containerizer with pid isolation. The root cause of these issues originate
to the interactions between how we launch the executor, and the role of PID
1.

The CommandInfo for the ExecutorInfo uses the default `shell` value which
is `true`[1]. This means that in any PID isolated container the `sh`
process that launches the executor will become PID 1. Here is an example
`ps` output from vagrant showing this:
````
root@aurora:/# ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       250  0.0  0.0  21928  2124 ?        Ss   01:19   0:00 /bin/bash
root       469  0.0  0.0  19176  1240 ?        R+   01:28   0:00  \_ ps auxf
root         1  0.0  0.0   4328   636 ?        Ss   01:10   0:00 /bin/sh -c
${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer
root         5  0.7  1.4 1201128 45604 ?       Sl   01:10   0:08 python2.7
/mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer-
root        23  0.1  0.6 115668 20764 ?        S    01:10   0:01  \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
root        29  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root        34  0.0  0.0  20040  1476 ?        S    01:10   0:00      |
\_ /bin/bash -c      while true; do       echo hello world       sleep 10
  done
root       468  0.0  0.0   4228   348 ?        S    01:28   0:00      |
  \_ sleep 10
root        31  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root        32  0.0  0.0  20040  1476 ?        S    01:10   0:00
 \_ /bin/bash -c      while true; do       echo hello world       sleep 10
    done
root       467  0.0  0.0   4228   352 ?        S    01:28   0:00
   \_ sleep 10
root        47  0.0  0.0  24116  3052 ?        S    01:10   0:00 python
./daemon.py
````

This means processes that double fork/daemonize will be re parented to `sh`
and not our executor. You can see that the `python daemon.py` process has
been reparented to `sh` and not the executor and is outside of the scope of
the runners. This has a number of undesirable implications, perhaps most
concerning is that processes that end up reparenting to PID 1 will not
receive SIGTERM or SIGKILL from thermos but instead will be killed by the
kernel when thermos decides to to exit. If anyone here decides to run
published images that use popular software that double forks (like nginx),
you will never be able to ensure the processes die cleanly.

I've been thinking about this problem for a while and upon advice from
others and my own research I believe the best solution is as follows:
1. We have good reasons for setting `shell=True` when launching the
executor. I'm not comfortable changing this because I'm not sure of all of
the implications if we choose another method.
2. The thermos runners end up forking off the target processes. I think the
runners should be responsible for all of the processes that are created by
the children.
3. We can make the runners responsible for their grand children by using
`prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
This means double forked processes will be reparented to the runner and not
PID 1
4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
PIDs they recorded and any other children they have.
5. Each runner would need to have a SIGCHLD handler to handle zombie
processes that are reparented to it.

[1]: https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3d
f3de5b34af/src/main/java/org/apache/aurora/scheduler/configuration/executor/
ExecutorModule.java#L109-L135
[2]: http://man7.org/linux/man-pages/man2/prctl.2.html

-- 
Zameer Manji

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message