aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erb, Stephan" <>
Subject Re: non-prod SLA stats
Date Tue, 16 Jun 2015 09:13:55 GMT
Hi Maxim,

I have submitted a first patch closely following your initial proposal.  The patch needs another
iteration or two, so please let me know what you think. 


From: Maxim Khutornenko <>
Sent: Monday, June 1, 2015 6:30 PM
Subject: Re: non-prod SLA stats

Hi Stephan,

Thanks for you analysis. I must mention though that SLA algorithms
were optimized for readability rather than CPU performance.

Given the current minutely run cycle, I would not be concerned about
calculation delay unless it threatens to break the schedule. In a
large cluster with tens of thousands of SLA metrics (both prod and
non-prod) the average observed SLA run time is around 4 seconds, which
gives us plenty of headroom for growth.

I am more concerned about the heap space used to store computed
counters here. This may quickly become a bottleneck and as a side
effect make our /vars endpoint unusable. Hence, the suggestion to make
non-essential stats fully toggle-able.

That said, if you envision a different use case with a much larger
metric set or anticipate a more frequent run schedule - feel free to
propose patches. I'd also encourage to invest some time into an SLA
benchmark using our JMH-based harness to back your changes with real
perf data.


On Mon, Jun 1, 2015 at 3:26 AM, Erb, Stephan
<> wrote:
> Hi Maxim,
> introducing some toggles for metric collection should definitely work and can be contributed
via a simple pull request.
> However, if your are only concerned about a potential performance hit, we might as well
think about tuning the existing metric calculation. I have skimmed the code, and there seem
to be several more or less low-hanging fruits:
> * The uptime computation performs the task enumeration and sorting operation for every
percentile, whereas this only needs to be done once.
> * The current approach used to compute a percentile takes O(n log n) time. There are
alternative solutions running in only O(n) time.
> * There are some unnecessary allocations, i.e., SlaUtil.percentile() is always called
on a temporary list. However, the first thing it does is to create a copy of that list.
> How about that: I will file a ticket for non-prod SLA stats and contribute a simple patch
with toggles. If it turns out that these are unusable for twitter-scale, we can look into
basic performance tuning.
> Best Regards,
> Stephan
> ________________________________________
> From: Maxim Khutornenko <>
> Sent: Friday, May 29, 2015 7:23 PM
> To:
> Subject: Re: non-prod SLA stats
> Hi Stephan,
> Tracking the same set of metrics for all non-prod jobs could be
> somewhat expensive on both collection and consumption sides. The only
> metrics we currently chose to collect are MTTA/R to help us monitor
> scheduling rate in view of reduced cluster capacity (AURORA-774).
> Perhaps we could put non-prod collection behind a set of command line
> switches (Arg<Boolean>)? E.g.:
> These could be defined in SlaModule and injected into MetricCalculator
> to let us finely tune the required non-prod collection set. What do
> you think?
> Thanks,
> Maxim
> On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan
> <> wrote:
>> Hi everyone,
>> we are are interested in the job uptime percentiles and the aggregate cluster uptime
percentage not only for production jobs, but also for our non-production jobs.
>> Are there any reasons why those are not available in a non-prod version, similar
to the current handling of mtta and mttr [1]?  If there are no objections, I will prepare
a patch.
>> Regards,
>> Stephan
>> [1]
View raw message