Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F1548200CCA for ; Wed, 19 Jul 2017 22:55:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id EF911169E12; Wed, 19 Jul 2017 20:55:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1B3F2169E19 for ; Wed, 19 Jul 2017 22:55:04 +0200 (CEST) Received: (qmail 36324 invoked by uid 500); 19 Jul 2017 20:55:04 -0000 Mailing-List: contact issues-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list issues@aurora.apache.org Received: (qmail 36313 invoked by uid 99); 19 Jul 2017 20:55:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jul 2017 20:55:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id BCBDF1A1A1F for ; Wed, 19 Jul 2017 20:55:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id QlD77pZ8VVlt for ; Wed, 19 Jul 2017 20:55:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B89A660D70 for ; Wed, 19 Jul 2017 20:55:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C6A1CE0D54 for ; Wed, 19 Jul 2017 20:55:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3D4AF21EB1 for ; Wed, 19 Jul 2017 20:55:00 +0000 (UTC) Date: Wed, 19 Jul 2017 20:55:00 +0000 (UTC) From: "Reza Motamedi (JIRA)" To: issues@aurora.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 19 Jul 2017 20:55:06 -0000 [ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093761#comment-16093761 ] Reza Motamedi commented on AURORA-1939: --------------------------------------- I see this problem when the host is super busy, and resource collection is backlogged. In this case I also see much more errors of this kind in the log: {noformat} D0719 20:18:28.064794 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62193) D0719 20:18:35.351458 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=14711) D0719 20:18:35.552953 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62331) D0719 20:18:42.857400 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.01s D0719 20:18:43.753732 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62338) D0719 20:18:48.454077 24474 mesos_vars.py:384] Metrics collection took 6506.1ms D0719 20:18:50.253031 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62345) D0719 20:18:57.861535 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:19:12.955235 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.01s D0719 20:19:14.959180 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62361) D0719 20:19:14.960768 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62232) D0719 20:19:18.056128 24474 mesos_vars.py:384] Metrics collection took 6008.0ms D0719 20:19:22.856868 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62366) D0719 20:19:28.048165 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.09s D0719 20:19:28.660691 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62374) D0719 20:19:43.051047 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:19:48.355678 24474 mesos_vars.py:384] Metrics collection took 6299.8ms D0719 20:19:58.148663 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.10s D0719 20:20:11.449485 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62271) D0719 20:20:13.155102 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.01s D0719 20:20:18.249528 24474 mesos_vars.py:384] Metrics collection took 6179.9ms D0719 20:20:23.354832 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=11317) D0719 20:20:27.060431 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62281) D0719 20:20:28.160298 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:20:35.452637 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62289) D0719 20:20:43.252589 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.09s D0719 20:20:48.151144 24474 mesos_vars.py:384] Metrics collection took 6058.3ms D0719 20:20:55.254796 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62428) D0719 20:20:58.257311 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:21:13.352955 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.10s D0719 20:21:17.555244 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62124) {noformat} > Thermos landing (host) page reports incorrect CPU rates when it is busy > ----------------------------------------------------------------------- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug > Reporter: Reza Motamedi > Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos Processes. On a busy machine, I have noticed negative CPU values when visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly create short lived children. This indicates that in time between `process_collector_psutil` looks up the Process children and the time it calculates the CPU time the pid of the child is actually reused by another much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)