Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 05280200CB3 for ; Mon, 26 Jun 2017 20:26:13 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 036D7160BDE; Mon, 26 Jun 2017 18:26:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2352D160BDA for ; Mon, 26 Jun 2017 20:26:11 +0200 (CEST) Received: (qmail 96451 invoked by uid 500); 26 Jun 2017 18:26:11 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 96430 invoked by uid 99); 26 Jun 2017 18:26:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jun 2017 18:26:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id AAF12C061B; Mon, 26 Jun 2017 18:26:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.201 X-Spam-Level: **** X-Spam-Status: No, score=4.201 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, NML_ADSP_CUSTOM_MED=1.2, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 7ZZ9XOCNDc2W; Mon, 26 Jun 2017 18:26:08 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id BFC925F659; Mon, 26 Jun 2017 18:26:07 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 241F3E0026; Mon, 26 Jun 2017 18:26:07 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 1C22EC40388; Mon, 26 Jun 2017 18:26:05 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============3256417308305513152==" MIME-Version: 1.0 Subject: Re: Review Request 60376: Observer task page to load consumption info from history From: Reza Motamedi To: Santhosh Kumar Shanmugham , David McLaughlin , Joshua Cohen Cc: Aurora , Stephan Erb , Reza Motamedi , Aurora ReviewBot Date: Mon, 26 Jun 2017 18:26:04 -0000 Message-ID: <20170626182604.15286.24699@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Reza Motamedi X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/60376/ X-Sender: Reza Motamedi References: <20170626173808.15286.89523@reviews-vm2.apache.org> In-Reply-To: <20170626173808.15286.89523@reviews-vm2.apache.org> Reply-To: Reza Motamedi X-ReviewRequest-Repository: aurora archived-at: Mon, 26 Jun 2017 18:26:13 -0000 --===============3256417308305513152== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On June 26, 2017, 5:38 p.m., Stephan Erb wrote: > > What is your main optimization objective? Reducing page load time or reducing steady observer CPU load? > > > > I have observed that when running many tasks per node (say ~30-100), it can happen that the metric collection threads essentially starve the UI from almost all CPU time (due to the Python GIL). In these cases, it would actually be better to just use fresh metrics all the time and eliminate the regular collection instead. This would result in slower UI rending but should yield more consistent latency. > > Reza Motamedi wrote: > I observed the same problem as well. My objective was to reduce page load time and what worked best was to reuse the collected resource consumption data. This lets us keep all the information that we currently provide. > > I did a more or less through profiling of what consumes the most CPU and takes the longest and saw that looking up the children of a pid seems to be very CPU intensive. Check the psutil implementation here: [Process.children](https://pythonhosted.org/psutil/_modules/psutil.html#Process.children). Constanly running this in the background does not seem to help :). > > I agree that the background thread that computes the resource consumption of all processes isn't very useful, and perhaps it might be better to collect all consumption data as users visit pages. However, We need to remember that the thread is actually performing some collections that could easily become slow to compute, for instance running DU on `n` sandboxes. Also, users can easily flood the UI by constantly refreshing the page, and triggering repeated work. > > An alternative solution would be to keep the disk collection inside an always running thread and collect the CPU and mem as users visit the page. This should only change what we do in showing the Thermos host (landing) page. Although, I am not sure how that would perform in practice when the `du` is backlogged. - Reza ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60376/#review178907 ----------------------------------------------------------- On June 22, 2017, 8:18 p.m., Reza Motamedi wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60376/ > ----------------------------------------------------------- > > (Updated June 22, 2017, 8:18 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, and Santhosh Kumar Shanmugham. > > > Repository: aurora > > > Description > ------- > > # Observer task page to load consumption info from history > > Resource consumptions of Thermos Processes are periodically calculated by TaskResourceMonitor threads (one thread per Thermos task). This information is used to display a (semi) fresh state of the tasks running on a host in the Observer host page, aka landing page. An aggregate history of the consumptions is kept at the task level, although TaskResourceMonitor needs to first collect the resource at the Process level and then aggregate them. > > On the other hand, when an Observer _task page_ is visited, the resources consumption of Thermos Processes within that task are calculated again and displayed without being aggregated. This can become very slow since time to complete resource calculation is affected by the load on the host. > > By applying this patch we take advantage of the periodic work and fulfill information resource requested in Observer task page from already collected resource consumptions. > > > Diffs > ----- > > src/main/python/apache/thermos/monitoring/resource.py 434666696e600a0e6c19edd986c86575539976f2 > src/test/python/apache/thermos/monitoring/test_resource.py d794a998f1d9fc52ba260cd31ac444aee7f8ed28 > > > Diff: https://reviews.apache.org/r/60376/diff/1/ > > > Testing > ------- > > I stress tested this patch on a host that had a slow Observer page. Interestingly, I did not need to do much to make the Observer slow. There are a few points to be made clear first. > - We at Twitter limit the resources allocated to the Observer using `systemd`. The observer is allowed to use only 20% of a CPU core. The attached screen shots are from such a setup. > - Having assigned 20% of a cpu core to Observer, starting only 8 `task`s, each with 3 `process`es is enough to make the Observer slow; 11secs to load `task page`. > > > File Attachments > ---------------- > > without the patch -- Screen Shot 2017-06-22 at 1.11.12 PM.png > https://reviews.apache.org/media/uploaded/files/2017/06/22/03968028-a2f5-4a99-ba57-b7a41c471436__without_the_patch_--_Screen_Shot_2017-06-22_at_1.11.12_PM.png > with the patch -- Screen Shot 2017-06-22 at 1.07.41 PM.png > https://reviews.apache.org/media/uploaded/files/2017/06/22/5962c018-27d3-4463-a277-f6ad48b7f2d7__with_the_patch_--_Screen_Shot_2017-06-22_at_1.07.41_PM.png > > > Thanks, > > Reza Motamedi > > --===============3256417308305513152==--