aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Khutornenko" <ma...@apache.org>
Subject Re: Review Request 20398: Scheduler SLA metrics.
Date Fri, 18 Apr 2014 18:46:44 GMT


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/async/RescheduleCalculator.java, line
117
> > <https://reviews.apache.org/r/20398/diff/2/?file=561774#file561774line117>
> >
> >     Pull this part up to the previous line:
> >     
> >       ITaskEvent activeEvent = Iterables.find(
> >           arg1,
> >           arg2);

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java, line 141
> > <https://reviews.apache.org/r/20398/diff/2/?file=561776#file561776line141>
> >
> >     This is a lazy iterable - you probably want to call toList() at the end to avoid
on-the-fly filtering each time you walk over it.

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 114
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line114>
> >
> >     Log message would be more useful with the task ID.

I have realized that errors were not reported from the executor thread. Added afterExecute()
into the executor and converted this to explicit exception to avoid unneeded string concatenation.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 155
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line155>
> >
> >     Have you considered making MedianAlgorithm non-abstract, and have it accept
a ScheduleStatus in the constructor?  That would eliminate the need for these two classes.

These were mostly placeholders for the documentation. I guess we don't lose much from collapsing
them into one. Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 199
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line199>
> >
> >     Can you elaborate or clarify "relative to request time"?

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 200
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line200>
> >
> >     The decay comment threw me a bit - this might be improved by using consistent
terminology "time frame".

Dropped it completely as it only adds confusion here.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 226
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line226>
> >
> >     Can you add a comment explaining the rationale here?  I would have expected
that this would only look at tasks currently in RUNNING.

Added a comment. We consider states like KILLING or RESTARTING as alive for job uptime.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 233
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line233>
> >
> >     s/numElements/percentileIndex/?

percentileElements would be more accurate here as it is not the index yet. The next line yields
index.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 235
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line235>
> >
> >     It would be good to extract a function to calculate the Nth percentile of a
List of numbers.  As it stands, this is duplicated and also inconsistent with the median calculation
(the other one averages when there's a non-even number of elements).
> >     
> >     You might even go so far to extract a PercentileAlgorithm, whose constructor
accepts something like Predicate<IScheduledTask> and Function<IScheduledTask, Long>.
 I'll leave that decision up to you, it might not make sense.

I see your point but don't think we should mix the two. The percentile calculation needs to
be discrete to accurately reflect the uptime at the specified instance percentile (floor(percentileElements)),
whereas the median algorithm tends to report the mid average (floor(average(n/2, n/2 - 1))).
I would prefer to keep uptime calculation more conservative with respect to the reported uptime
value.    


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 253
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line253>
> >
> >     The name and doc don't line up with the values returned.  You're returning a
percentage [0, 100] (higher is better), but everything else suggests it's accumulated downtime
(implying higher is worse).

Agreed. Changed it to AggregatePlatformUptime to better align with job uptime and exposed
metric name.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 255
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line255>
> >
> >     please doc

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 328
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line328>
> >
> >     Given that some functions deal with collections of events that pertain to different
tasks, it would be helpful to document places like this where the events must be for a single
task.

It's actually the opposite here. List<ITaskEvent> represents a cross task unified view
of the instance history. The TASKS_TO_TASK_EVENTS creates an instance timeline and it remains
like that to the end.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 333
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line333>
> >
> >     Use ImmutableList.Builder, to make sure that an immutable list is returned.

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 339
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line339>
> >
> >     Please add comments in here, i suspect a newcomer would be pretty unclear on
what's going on vs what's intended.

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 408
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line408>
> >
> >     Can you more thoroughly document the behavior here?

Sure, done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 413
> > <https://reviews.apache.org/r/20398/diff/2/?file=561777#file561777line413>
> >
> >     In your comment, please be sure to explain why the events from multiple tasks
are merged and sorted.  Also, please rename TASKS_TO_TASK_EVENTS to indicate the sorting behavior.

Done.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaGroup.java, line 103
> > <https://reviews.apache.org/r/20398/diff/2/?file=561778#file561778line103>
> >
> >     Please share instance size definitions with SlotSizeCounter to prevent drift.

Good idea. Converged on ResourceAggregates.


> On April 17, 2014, 11:41 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaGroup.java, line 138
> > <https://reviews.apache.org/r/20398/diff/2/?file=561778#file561778line138>
> >
> >     These three classes seem to have a pretty big overlap.  Can you solve this with
a single implementation that accepts:
> >     
> >       Map<String, Range<Double>>
> >       Function<IScheduledTask, Double>

I tried a similar approach before but did not like the upcasting to double everywhere (in
the map init and within the function), so I would rather prefer the current approach as less
clumsy. 


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20398/#review40705
-----------------------------------------------------------


On April 17, 2014, 11:39 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20398/
> -----------------------------------------------------------
> 
> (Updated April 17, 2014, 11:39 p.m.)
> 
> 
> Review request for Aurora, Kevin Sweeney, Bill Farner, and Brian Wickman.
> 
> 
> Bugs: AURORA-293
>     https://issues.apache.org/jira/browse/AURORA-293
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> High level overview:
> - MetricCalculator runs periodically (every minute), pulls all task history, packages
it into SlaInstance list and updates stats;
> - Stat calculation is handled by a pair of: SlaAlgorithm and a set of applicable SlaGroups
(logical groupings by job, cluster, resource and etc.);
> - Stat name is generated by combining group and algorithm name parts.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/app/AppModule.java eeafc784e915137cacd5f64df1252ccbaf6c0f6c

>   src/main/java/org/apache/aurora/scheduler/async/RescheduleCalculator.java bc77bf2e6fbc1ff4159d049b0c28afd6832499ef

>   src/main/java/org/apache/aurora/scheduler/base/Tasks.java fae2d237d18f945d4dd73ec56cd42981359bea46

>   src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaGroup.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaModule.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/sla/MetricCalculatorTest.java PRE-CREATION

>   src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaTestUtil.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20398/diff/
> 
> 
> Testing
> -------
> 
> gradle build
> gradle run
> 
> Sample from local scheduler:
> 
> sla_cluster_mtta 3271.493
> sla_cluster_mttr 3273.497
> sla_cluster_platform_uptime_percent 100.0
> sla_cpu_small_mtta 3271.493
> sla_cpu_small_mttr 3273.497
> sla_disk_small_mtta 3271.493
> sla_disk_small_mttr 3273.497
> sla_mesos_test_serviceJob0_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob0_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob0_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob0_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob0_mtta 0.0
> sla_mesos_test_serviceJob0_mttr 0.0
> sla_mesos_test_serviceJob0_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob10_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob10_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob10_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob10_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob10_mtta 0.0
> sla_mesos_test_serviceJob10_mttr 0.0
> sla_mesos_test_serviceJob10_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob12_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob12_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob12_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob12_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob12_mtta 0.0
> sla_mesos_test_serviceJob12_mttr 0.0
> sla_mesos_test_serviceJob12_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob14_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob14_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob14_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob14_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob14_mtta 0.0
> sla_mesos_test_serviceJob14_mttr 0.0
> sla_mesos_test_serviceJob14_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob16_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob16_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob16_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob16_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob16_mtta 0.0
> sla_mesos_test_serviceJob16_mttr 0.0
> sla_mesos_test_serviceJob16_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob18_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob18_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob18_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob18_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob18_mtta 0.0
> sla_mesos_test_serviceJob18_mttr 0.0
> sla_mesos_test_serviceJob18_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob2_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob2_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob2_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob2_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob2_mtta 0.0
> sla_mesos_test_serviceJob2_mttr 0.0
> sla_mesos_test_serviceJob2_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob4_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob4_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob4_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob4_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob4_mtta 0.0
> sla_mesos_test_serviceJob4_mttr 0.0
> sla_mesos_test_serviceJob4_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob6_job_uptime_75_00_sec 7125
> sla_mesos_test_serviceJob6_job_uptime_90_00_sec 7125
> sla_mesos_test_serviceJob6_job_uptime_95_00_sec 7125
> sla_mesos_test_serviceJob6_job_uptime_99_00_sec 7125
> sla_mesos_test_serviceJob6_mtta 3271.493
> sla_mesos_test_serviceJob6_mttr 3273.497
> sla_mesos_test_serviceJob6_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob8_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob8_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob8_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob8_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob8_mtta 0.0
> sla_mesos_test_serviceJob8_mttr 0.0
> sla_mesos_test_serviceJob8_platform_uptime_percent 100.0
> sla_ram_small_mtta 3271.493
> sla_ram_small_mttr 3273.497
> 
> 
> File Attachments
> ----------------
> 
> Coverage report
>   https://reviews.apache.org/media/uploaded/files/2014/04/16/ffe00c63-bb3a-4b90-95f8-f23878b0fdab__SLA_coverage_report.png
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message