aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhitao Li" <zhitaoli...@gmail.com>
Subject Re: Review Request 42126: Accept resource offers from multiple framework roles.
Date Thu, 14 Jan 2016 16:30:26 GMT


> On Jan. 14, 2016, 1:06 a.m., Dmitriy Shirchenko wrote:
> > src/test/java/org/apache/aurora/scheduler/mesos/MesosTaskFactoryImplTest.java, lines
216-217
> > <https://reviews.apache.org/r/42126/diff/8/?file=1196146#file1196146line216>
> >
> >     Is this intentionally commented out?
> 
> Zhitao Li wrote:
>     Yes this is intentially removed. Old behavior of changing this `config` here resulted
in `resources` in static final object `OFFER` not sufficient to be allocated, and I don't
see any point to use a different executor for this test case, so I've deleted this code.
>     
>     If any reviewer thinks there is a reason to test this with a different executor,
I can reconstruct an `Offer` object with correct resource and add this back.
> 
> Bill Farner wrote:
>     This is slightly suspicious...are you sure resource accounting wasn't impacted in
some way?  I would expect old test cases to logically work with this feature.
> 
> Zhitao Li wrote:
>     I recalled why I commented this out in the first pass now (this was a bit tricky
and I mistakenly thought it's logical bug while it's due to floating point precision).
>     
>     In TaskExecutors.java, `SOME_OVERHEAD_EXECUTOR` uses `0.01` cpus:
>     ```
>       public static final ExecutorSettings SOME_OVERHEAD_EXECUTOR =
>           TestExecutorSettings.thermosOnlyWithOverhead(
>               new ResourceSlot(0.01, Amount.of(256L, Data.MB), Amount.of(0L, Data.MB),
0));
>     ```
>     Unfortunately, `5.01 - 5 - 0.01` **is not zero** in floating point math of Java,
so the test keeps throwing `InsufficientResourcesException` even if I constructed an `Offer`
whose resource is correctly equivalent to `ResourceSlot.from(TASK_CONFIG).add(TaskExecutors.SOME_OVERHEAD_EXECUTOR.getExecutorOverhead())`.
>     
>     **My proposal is to change all comparison with zero in `AcceptedOffer` class (3 places
in current patch) to compare with an `EPSILON` whose value is `1e-6` or something even smaller.**
I don't think there is any reasonable usage of resource unit smaller than this granularity,
and the floating point math in the class would be more robust. We can document this behavior
somewhere if it sounds interesting.
>     
>     The default resource usage for `THERMOS_EXECUTOR` is 0.25 so I don't encounter this
elsewhere.
>     
>     Some alternatives which I liked less:
>     1. Change `TaskExecutors.SOME_OVERHEAD_EXECUTOR`'s cpus to a value which could be
precisely represented by floating point (0.125 or something similar);
>     2. Manually add a bit more cpu resource to the `Offer` object to trick the test to
pass.
>     
>     If the above proposed solution sounds fine, I can send an update to the patch.
> 
> Bill Farner wrote:
>     I'm in favor of using an EPSILON value.  Please be sure to extract a function to
handle the equivalence check.
>     
>     It's too bad mesos doesn't use fixed point for scalar resources to avoid these issues.
 Looks like there's been recent momentum to make that change: https://issues.apache.org/jira/browse/MESOS-3997
>     
>     They recently addressed an instance of this issue with this change:
>     https://github.com/apache/mesos/commit/5b5dd566aea71f654c51b2706a958ca1b5cb07a3
>     
>     ```
>     CHECK_NEAR(result.cpus().get(), cpus().get(), MIN_CPUS);
>     ```
>     
>     For reference, `MIN_CPUS` is 0.01.

Will do with a `nearZero()` helper function.

For the value of `ESPILON`, I took a quick look at related Mesos code, and the minimum gradunalar
which could make a difference is `MIN_CPUS` / `CPU_SHARES_PER_CPU` (1024) in cgroup based
isolation, which is still larger than `1e-6`, so using the latter value should ensure Aurora
user won't lose precision even at extreme scheduling condition.


- Zhitao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42126/#review114353
-----------------------------------------------------------


On Jan. 14, 2016, 2:46 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42126/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 2:46 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Dmitriy Shirchenko, and Bill Farner.
> 
> 
> Bugs: AURORA-1109
>     https://issues.apache.org/jira/browse/AURORA-1109
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This review is a prototype for introducing multiple role support in Aurora.
> This creates a new class OfferAllocation, which allcoates resources to resources field
in TaskInfo and ExecutorInfo from an offer.
> 
> Current implementation prefers reserved resources over shared resources ('*' role) if
both are present
> 
> Several caveats:
> 1. This performs the allocate after scheduling decision in TaskAssigner.maybeAssign is
done, which leaves possibility of inconsistency and late failure.
> 
> 
> Diffs
> -----
> 
>   NEWS acaff9eb2ab184b0ef750f8b8a00c20131997f6b 
>   src/main/java/org/apache/aurora/scheduler/AcceptedOffer.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/ResourceSlot.java 7c3d681c216b78eeecebbe950186e5a79c6fe982

>   src/main/java/org/apache/aurora/scheduler/Resources.java db422a959ee7b982c2a44323de41ad75d1a40754

>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
2255dd407cd1810c7df5baf17cfa85f79bfffeb8 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosTaskFactory.java 8fdadda67478bb3110aa442b7d78493cf9c3edb4

>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 7e8e456e288986eb0ce92a123b294e1e25d8ed18

>   src/test/java/org/apache/aurora/scheduler/AcceptedOfferTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/ResourceSlotTest.java e4ae943303823ac4bfbe999ed22f5999484462d8

>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java
33149ab415292eff04f38b61f2b1d1eac79f347a 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosTaskFactoryImplTest.java a5793bffabf4e5d6195b1b99f2363d241c0cecf9

>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 3cbe9acd75def14ae2e0986914ba621fb164b3e4

> 
> Diff: https://reviews.apache.org/r/42126/diff/
> 
> 
> Testing
> -------
> 
> 1. Unit tested with old and new tests;
> 2. vagrant integration tests: I manually separate out the vagrant box's cpu and memory
between 'aurora-test' role and '*' and verified that jobs can still be launched (I can post
the vagrant change in another follow upon request).
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message