aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reza Motamedi <reza.motam...@gmail.com>
Subject Re: Review Request 60748: Prototype using cgroups for monitoring Thermos Process resource consumption (CPU and memory)
Date Tue, 11 Jul 2017 06:47:42 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60748/
-----------------------------------------------------------

(Updated July 11, 2017, 6:47 a.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer
Manji.


Changes
-------

Addressing lint errors...


Repository: aurora


Description (updated)
-------

# Prototype using cgroups for monitoring Thermos Process resource consumption (CPU and memory)
The idea behind this prototype is to use kernel cgroups instead of per pid monitoring of Thermos
Tasks and Processes.
This [document](https://docs.google.com/a/twitter.com/document/d/16JFIqY2ftvNNXxYf6jQwO6EXPajCKp7kPJRAQSsaPko/edit?usp=sharing)
describes more about the problem that this prototype tries to solve.

__Note:__ Since I am piggybacking on the cgroup clean-up implemented in Mesos, if Mesos's
memory and CPU isolation are not enabled, I will not create cgroups and will simply revert
to using old monitoring scheme. 

__Important Compatibilty:__ It also came to my attention that this kind of monitoring for
memory only works when `memory.use_hierarchy` flag is enabled. At least in my vagrant this
does not seem to be the case, therefore some support on the Mesos side is needed first.


# Notes on Performance:

I used `top -p <thermos-pid> -bc -n 10 | grep 'python'` to monitor the cpu usage of
thermos on my vagrant. I had 7 Tasks each with 3 Processes.
> Stock Thermos Observer
```
21641 root      20   0 1351200  44448   4088 S   6.6  1.4   0:35.69 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44448   4088 S   2.7  1.4   0:35.77 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44448   4088 S   3.3  1.4   0:35.87 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44448   4088 S   2.3  1.4   0:35.94 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44448   4088 S   4.3  1.4   0:36.07 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44448   4088 S   3.6  1.4   0:36.18 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351204  44616   4088 S  11.6  1.4   0:36.53 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44552   4088 S  39.6  1.4   0:37.72 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44552   4088 S   2.7  1.4   0:37.80 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
21641 root      20   0 1351200  44552   4088 S   7.6  1.4   0:38.03 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO
```
> Thermos Observer using CGROUP monitoring
```
15203 root      20   0 1367828  45344   4088 S   6.6  1.5   0:55.37 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1367828  45344   4088 S   2.0  1.5   0:55.43 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   4.3  1.5   0:55.56 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   2.3  1.5   0:55.63 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   2.0  1.5   0:55.69 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   3.3  1.5   0:55.79 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   2.3  1.5   0:55.86 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   1.0  1.5   0:55.89 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   2.3  1.5   0:55.96 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
15203 root      20   0 1351436  45308   4088 S   3.3  1.5   0:56.06 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex
--ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO
```


Diffs (updated)
-----

  examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f

  src/main/python/apache/aurora/executor/thermos_task_runner.py 8f88af4c24ddc603fa12587741af56a6c711e420

  src/main/python/apache/thermos/core/cgroup.py PRE-CREATION 
  src/main/python/apache/thermos/core/process.py 4a4678ff39c84cb87836aca19365c5b2aabc4fa4

  src/main/python/apache/thermos/monitoring/process_collector_cgroup.py PRE-CREATION 
  src/main/python/apache/thermos/monitoring/resource.py 434666696e600a0e6c19edd986c86575539976f2

  src/main/python/apache/thermos/observer/http/templates/task.tpl f3e06985eb3c05572aa4389d97da575b1179f616



Diff: https://reviews.apache.org/r/60748/diff/3/

Changes: https://reviews.apache.org/r/60748/diff/2-3/


Testing
-------

This patch is mostly a prototype. Note that I had to enable Mesos's cpu and memory isolation.

Current tests pass. I first want to see how the community feels generally about this approach,
and then I will add additional tests.


Thanks,

Reza Motamedi


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message