mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: A problem with resource offers
Date Fri, 07 Nov 2014 02:53:13 GMT
Which version of the master are you using and do you have the logs? The
fact that no offers were coming back sounds like a bug!

As for using O1 after a disconnection, all offers are invalid once a
disconnection occurs. The scheduler driver does not automatically rescind
offers upon disconnection, so I'd recommend clearing all cached offers when
your scheduler gets disconnected, to avoid the unnecessary TASK_LOST
updates.

On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <spodila@netflix.com> wrote:

> We had an interesting problem with resource offers today and I would like
> to confirm this problem and request an enhancement. Here's the summary in
> the right sequence of events:
>
> 1. resource offer O1 for slave A arrives
> 2. mesos disconnects
> 3. mesos reregisters
> 4. mesos offer O2 for slave A arrives
>     (our framework keeps offers for sometime if unused, therefore, we now
> have both O1 and O2, incorrectly)
> 5. launch task T1 using offers O1 and O2
> 6. framework thinks it has no offers with it now for slave A, will wait
> for new offer after mesos consumes resources for task T1
> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer
>     (even though only O1 was invalid, O2 is gone missing silently)
> 8. no more offers come for slave A
> 9. basically we have an offer leak problem.
>
> To work around this, I am changing my framework so that when it receives
> mesos reregistration callback (step 3 above), it removes all existing
> offers. This should fix the problem.
>
> However, I am wondering if #7 can be improved in Mesos. When a task is (or
> set of tasks are) launched using multiple offers, if at least one of the
> offers is invalid, then Mesos should treat all offers as given up by the
> framework. This will send TASK_LOST to the framework, but, also make the
> valid offers available again through new offers.
>
> I am thinking this will be critical to do when Mesos starts rescinding
> offers. Because in that case the frameworks cannot rely on the strategy
> like the one I am using with reregistration.
>
> Sharma
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message