aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erb, Stephan" <Stephan....@blue-yonder.com>
Subject Re: non-prod SLA stats
Date Mon, 01 Jun 2015 10:26:47 GMT
Hi Maxim,

introducing some toggles for metric collection should definitely work and can be contributed
via a simple pull request. 

However, if your are only concerned about a potential performance hit, we might as well think
about tuning the existing metric calculation. I have skimmed the code, and there seem to be
several more or less low-hanging fruits:

* The uptime computation performs the task enumeration and sorting operation for every percentile,
whereas this only needs to be done once.
* The current approach used to compute a percentile takes O(n log n) time. There are alternative
solutions running in only O(n) time.
* There are some unnecessary allocations, i.e., SlaUtil.percentile() is always called on a
temporary list. However, the first thing it does is to create a copy of that list.

How about that: I will file a ticket for non-prod SLA stats and contribute a simple patch
with toggles. If it turns out that these are unusable for twitter-scale, we can look into
basic performance tuning.

Best Regards,
Stephan
________________________________________
From: Maxim Khutornenko <maxim@apache.org>
Sent: Friday, May 29, 2015 7:23 PM
To: dev@aurora.apache.org
Subject: Re: non-prod SLA stats

Hi Stephan,

Tracking the same set of metrics for all non-prod jobs could be
somewhat expensive on both collection and consumption sides. The only
metrics we currently chose to collect are MTTA/R to help us monitor
scheduling rate in view of reduced cluster capacity (AURORA-774).
Perhaps we could put non-prod collection behind a set of command line
switches (Arg<Boolean>)? E.g.:

SLA_COLLECT_NON_PROD_MEDIANS
SLA_COLLECT_NON_PROD_JOB_UPTIMES
SLA_COLLECT_NON_PROD_PLATFORM_UPTIMES

These could be defined in SlaModule and injected into MetricCalculator
to let us finely tune the required non-prod collection set. What do
you think?

Thanks,
Maxim

On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan
<Stephan.Erb@blue-yonder.com> wrote:
> Hi everyone,
>
> we are are interested in the job uptime percentiles and the aggregate cluster uptime
percentage not only for production jobs, but also for our non-production jobs.
>
> Are there any reasons why those are not available in a non-prod version, similar to the
current handling of mtta and mttr [1]?  If there are no objections, I will prepare a patch.
>
> Regards,
> Stephan
>
> [1] https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java#L69
Mime
View raw message