mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Trying to get task reconciliation to work
Date Fri, 18 Apr 2014 17:46:52 GMT
If a framework asks to reconcile a task that doesn't belong to it there
would be no response from the master. This is nice because it avoids
information leak between frameworks.


On Fri, Apr 18, 2014 at 5:04 AM, David Greenberg <dsg123456789@gmail.com>wrote:

> Piggybacking onto this thread with a follow up question: what happens if
> you ask the master to reconcile some tasks that weren't launched by your
> framework? Will you get messages that express those tasks were unknown,
> lost, or will nothing respond?
>
>
> On Thursday, April 17, 2014, Sharma Podila <spodila@netflix.com> wrote:
>
>> No problem, I have a better understanding now.
>> And it was useful to see the three items you listed explicitly.
>>
>>
>> On Thu, Apr 17, 2014 at 2:39 PM, Benjamin Mahler <
>> benjamin.mahler@gmail.com> wrote:
>>
>> Good to see you were playing around with reconciliation, we should have
>> made the current semantics more clear. Especially in light of the fact that
>> it's not implemented fully until one uses a strict registrar (likely
>> 0.20.0).
>>
>> Think of reconciliation as the fallback mechanism to ensure that state is
>> consistent, it's not designed to be something to inform you of things you
>> were already told (in this case, that the tasks were running). Although we
>> could consider sending updates even when task state remains the same.
>>
>>
>> For the purpose of this conversation, let's say we're in the 0.20.0
>> world, operating with the registrar. And let's assume your goal is to build
>> a highly available framework (I will be documenting how to do this for
>> 0.20.0):
>>
>> (1) *When you receive a status update, you must persist this information
>> before returning from the statusUpdate() callback*. Once you return from
>> the callback, the driver will acknowledge the slave directly. Slaves will
>> retry status update delivery *until* the acknowledgement is received from
>> the scheduler driver in order to ensure that the framework processed the
>> update.
>>
>> (2) *When you receive a "slave lost" signal, it means that your tasks
>> that were running on that slave are in state TASK_LOST*, and any
>> reconciliation you perform for these tasks will result in a reply of
>> TASK_LOST. Most of the time we'll deliver these TASK_LOST automatically,
>> but with a confluence of Master *and* Slave failovers, we are unaware of
>> which tasks were running on the slave as we do not persist this information
>> in the Master.
>>
>> (3) To guarantee that you have a consistent view of task states. *You
>> must also periodically reconcile task state against the Master*. This is
>> only because the delivery of the "slave lost" signal in (2) is not reliable
>> (the Master could failover after removing a slave but before telling
>> frameworks that the slave was lost).
>>
>> You'll notice that this model forces one to serially persist all status
>> update changes. We are planning to expose mechanisms to allow "batch"
>> acknowledgement of status updates in the lower-level API that benh has
>> given talks about. With a lower-level API, it is possible to build more
>> powerful libraries that hide much of these details!
>>
>> You'll also perhaps notice that only (1) and (3) are strictly required
>> for consistency, but (2) is highly recommended as the vast majority of the
>> time the "slave lost" signal will be delivered and you can take action
>> quickly, without having to rely on periodic reconciliation.
>>
>> Please let me know if anything here was not clear!
>>
>>
>> On Thu, Apr 17, 2014 at 1:47 PM, Sharma Podila <spodila@netflix.com>wrote:
>>
>> Should've looked at the code before sending the previous email...
>>  master/main.cpp confirmed what I needed to know. It doesn't look like I
>> will be able to use reconcileTasks the way I thought I could. Effectively,
>> a lack of callback could either mean that the master agrees with the
>> requested reconcile task state, or that the task and/or slave is currently
>> unknown. Which makes it an unreliable source of data. I understand this is
>> expected to improve later by leveraging the registrar, but, I suspect
>> there's more to it.
>>
>> I take it then that individual frameworks need to have their own
>> mechanisms to ascertain the state of their tasks.
>>
>>
>> On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila <spodila@netflix.com>wrote:
>>
>> Hello
>>
>>

Mime
View raw message