aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Aurora grabbing resources even when not scheduling
Date Thu, 01 May 2014 04:37:46 GMT
Aurora holds offers for a few reasons:
- To avoid blocking the mesos driver callback thread while matching offers
to pending tasks
- To enable preemption (determine whether cluster resources are exhausted,
and a low priority task should be evicted for a high priority one)
- To perform optimize scheduling decisions (choosing the best offer based
on things like failure domains)

As clusters grow very large in terms of slaves and tasks, these features
become necessary for the scheduler to remain responsive and predictable.

As you experienced, this has the effect of starving cohort frameworks.  The
mesos team plans to implement offer revocation [1] to mitigate this.  In
the meantime, you can tune the amount of time aurora holds offers as a
workaround with the min_offer_hold_time [2] command line argument, e.g.:

-min_offer_hold_time=1secs


Unfortunately, this value is used in conjunction with a hard-coded jitter
[3], so you still have an upper bound of one minute hold time.  If this
presents an issue, we'd happily accept a patch to make the jitter window
tunable as well!


-=Bill

[1] https://issues.apache.org/jira/browse/MESOS-354
[2]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java#L95-98
[3]
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java#L323

On Wed, Apr 30, 2014 at 3:19 PM, mohit soni <mohitsoni1989@gmail.com> wrote:

> We observed that Aurora's CPU share in Mesos Master dashboard spikes upto
> 100%, even when Aurora is not running any Job. Looking at the code, I
> figured that scheduler holds on to the resourceOffers, even if there are no
> tasks to be scheduled and doesn't decline the offer immediately.
>
> It looks like an optimization, where the TaskLaunchers are kept primed with
> resourceOffers, so that the Job can be run as soon as it's scheduled (if
> task requirements are satisfied).
>
> But, this leads to an offer starvation problem for other peer frameworks
> who tend to decline offers if they don't have tasks to be scheduled (for
> the timeout period).
>
> How can we handle this in scenarios where Aurora is running with other peer
> frameworks ?
>
> Thanks
> Mohit
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message