Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 53DE5200CC5 for ; Tue, 11 Jul 2017 22:30:03 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 52A8B1672B6; Tue, 11 Jul 2017 20:30:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 723591672AC for ; Tue, 11 Jul 2017 22:30:02 +0200 (CEST) Received: (qmail 25842 invoked by uid 500); 11 Jul 2017 20:30:01 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 25822 invoked by uid 99); 11 Jul 2017 20:30:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jul 2017 20:30:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D014F188AC2; Tue, 11 Jul 2017 20:30:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.452 X-Spam-Level: **** X-Spam-Status: No, score=4.452 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, NML_ADSP_CUSTOM_MED=1.2, RP_MATCHES_RCVD=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kG0e7JP_BUWk; Tue, 11 Jul 2017 20:29:50 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 8CFFD62B09; Tue, 11 Jul 2017 20:12:12 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id DB0D4E010F; Tue, 11 Jul 2017 20:12:11 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 1524BC400A0; Tue, 11 Jul 2017 20:12:10 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============7226445086917056755==" MIME-Version: 1.0 Subject: Re: Review Request 60748: Prototype using cgroups for monitoring Thermos Process resource consumption (CPU and memory) From: Reza Motamedi To: Santhosh Kumar Shanmugham , David McLaughlin , Stephan Erb , Zameer Manji Cc: Aurora ReviewBot , Reza Motamedi , Aurora Date: Tue, 11 Jul 2017 20:12:09 -0000 Message-ID: <20170711201209.38508.20037@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Reza Motamedi X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/60748/ X-Sender: Reza Motamedi References: <20170711064742.38508.27112@reviews-vm2.apache.org> In-Reply-To: <20170711064742.38508.27112@reviews-vm2.apache.org> X-ReviewBoard-Diff-For: src/main/python/apache/thermos/core/cgroup.py X-ReviewBoard-Diff-For: src/main/python/apache/thermos/monitoring/process_collector_cgroup.py Reply-To: Reza Motamedi X-ReviewRequest-Repository: aurora archived-at: Tue, 11 Jul 2017 20:30:03 -0000 --===============7226445086917056755== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60748/ ----------------------------------------------------------- (Updated July 11, 2017, 8:12 p.m.) Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji. Repository: aurora Description (updated) ------- # Prototype using cgroups for monitoring Thermos Process resource consumption (CPU and memory) The idea behind this prototype is to use kernel cgroups instead of per pid monitoring of Thermos Tasks and Processes. This [document](https://docs.google.com/document/d/1i5GY8cK_KZ_ebG8V2FLXeu0waRqzSHz82bAHCud_yoQ/edit?usp=sharing) describes more about the problem that this prototype tries to solve. __Note:__ Since I am piggybacking on the cgroup clean-up implemented in Mesos, if Mesos's memory and CPU isolation are not enabled, I will not create cgroups and will simply revert to using old monitoring scheme. __Important Compatibilty:__ It also came to my attention that this kind of monitoring for memory only works when `memory.use_hierarchy` flag is enabled. At least in my vagrant this does not seem to be the case, therefore some support on the Mesos side is needed first. # Notes on Performance: I used `top -p -bc -n 10 | grep 'python'` to monitor the cpu usage of thermos on my vagrant. I had 7 Tasks each with 3 Processes. > Stock Thermos Observer ``` 21641 root 20 0 1351200 44448 4088 S 6.6 1.4 0:35.69 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44448 4088 S 2.7 1.4 0:35.77 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44448 4088 S 3.3 1.4 0:35.87 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44448 4088 S 2.3 1.4 0:35.94 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44448 4088 S 4.3 1.4 0:36.07 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44448 4088 S 3.6 1.4 0:36.18 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351204 44616 4088 S 11.6 1.4 0:36.53 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44552 4088 S 39.6 1.4 0:37.72 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44552 4088 S 2.7 1.4 0:37.80 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO 21641 root 20 0 1351200 44552 4088 S 7.6 1.4 0:38.03 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=NONE --log_to_stderr=google:INFO ``` > Thermos Observer using CGROUP monitoring ``` 15203 root 20 0 1367828 45344 4088 S 6.6 1.5 0:55.37 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1367828 45344 4088 S 2.0 1.5 0:55.43 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 4.3 1.5 0:55.56 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 2.3 1.5 0:55.63 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 2.0 1.5 0:55.69 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 3.3 1.5 0:55.79 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 2.3 1.5 0:55.86 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 1.0 1.5 0:55.89 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 2.3 1.5 0:55.96 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO 15203 root 20 0 1351436 45308 4088 S 3.3 1.5 0:56.06 python2.7 /home/vagrant/aurora/dist/thermos_observer.pex --ip=192.168.33.7 --port=1338 --log_to_disk=DEBUG --log_to_stderr=google:INFO ``` Diffs ----- examples/vagrant/mesos_config/etc_mesos-slave/isolation 1a7028ffc70116b104ef3ad22b7388f637707a0f src/main/python/apache/aurora/executor/thermos_task_runner.py 8f88af4c24ddc603fa12587741af56a6c711e420 src/main/python/apache/thermos/core/cgroup.py PRE-CREATION src/main/python/apache/thermos/core/process.py 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 src/main/python/apache/thermos/monitoring/process_collector_cgroup.py PRE-CREATION src/main/python/apache/thermos/monitoring/resource.py 434666696e600a0e6c19edd986c86575539976f2 src/main/python/apache/thermos/observer/http/templates/task.tpl f3e06985eb3c05572aa4389d97da575b1179f616 Diff: https://reviews.apache.org/r/60748/diff/3/ Testing ------- This patch is mostly a prototype. Note that I had to enable Mesos's cpu and memory isolation. Current tests pass. I first want to see how the community feels generally about this approach, and then I will add additional tests. Thanks, Reza Motamedi --===============7226445086917056755==--