aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reza Motamedi (JIRA)" <>
Subject [jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
Date Tue, 27 Jun 2017 05:21:00 GMT


Reza Motamedi commented on AURORA-1939:

On second thought, the negative CPU values can simply be caused by a dead child process. Let
me explain how. First, remember that CPU time reported by psutil, is the total CPU time spent
to progress a process.

Supposes at {{t_0 = 10}}, we have the following processes forked inside a thermos process.

__ p0
   \_ p1

The total CPU time of the thermos process is calculated at the CPU time in all the processes,
i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the sake of argument, let's say
1 second in {{p_0}} and 5 seconds in {{p_1}}.
Now imagine that by the time to collect the next sample at {{t_1 = 20}}, another 5 seconds
where spend in p_0, and p_0 finishes (dies) before the collection. Also, only an extra 1 second
was spent by {{p_0}}. The current calculation leads to the following reported CPU values.

(sum(new_samples) - sum(old_samples)) / (time difference).
(2) - (1 + 5) / 5 = -3/10.

A perfect calculation would include the time spend in the dead processes at the time of their
death in the new sample. What makes sense is to discard the old processes that have died during
the last time interval.

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> -----------------------------------------------------------------------
>                 Key: AURORA-1939
>                 URL:
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Reza Motamedi
>            Priority: Minor
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos Processes.
On a busy machine, I have noticed negative CPU values when visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly create short
lived children. This indicates that in time between `process_collector_psutil` looks up the
Process children and the time it calculates the CPU time the pid of the child is actually
reused by another much younger process, which leads to negative CPU times.

This message was sent by Atlassian JIRA

View raw message