mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: Trying to get task reconciliation to work
Date Fri, 18 Apr 2014 17:55:49 GMT
Vinod, David is asking about tasks that "belong" to the framework in that
they were "launched" by it, in which case your answer is not correct. We
don't keep track of tasks so we don't know whether the task "belongs" to
the framework in this sense.

David, you will either receive TASK_LOST or nothing (if the slave for
the task is in a transient state).

This is determined more so by the SlaveID than the TaskID as the Master
does not persistently track tasks.

(a) If you're asking about an unknown slave, you will get TASK_LOST.
(b) If you're asking about a known slave and an unknown task, you will get
TASK_LOST.
(c) If you're asking about a known slave and a known task with a different
state, you will be sent the latest state.

If you consider these semantics, you'll realize that you may receive
TASK_LOST if you try to reconcile your task that finished correctly. This
is why I mentioned the need to persist updates in (1) above. Let's say you
receive a terminal update of TASK_FINISHED and then you still try to
reconcile against a failed over Master. This new Master will reply with
TASK_LOST because it is unaware of the task/slave. So, you will always
receive your valid terminal update before getting a TASK_LOST from
reconciliation.


On Fri, Apr 18, 2014 at 10:46 AM, Vinod Kone <vinodkone@gmail.com> wrote:

> If a framework asks to reconcile a task that doesn't belong to it there
> would be no response from the master. This is nice because it avoids
> information leak between frameworks.
>
>
> On Fri, Apr 18, 2014 at 5:04 AM, David Greenberg <dsg123456789@gmail.com
> >wrote:
>
> > Piggybacking onto this thread with a follow up question: what happens if
> > you ask the master to reconcile some tasks that weren't launched by your
> > framework? Will you get messages that express those tasks were unknown,
> > lost, or will nothing respond?
> >
> >
> > On Thursday, April 17, 2014, Sharma Podila <spodila@netflix.com> wrote:
> >
> >> No problem, I have a better understanding now.
> >> And it was useful to see the three items you listed explicitly.
> >>
> >>
> >> On Thu, Apr 17, 2014 at 2:39 PM, Benjamin Mahler <
> >> benjamin.mahler@gmail.com> wrote:
> >>
> >> Good to see you were playing around with reconciliation, we should have
> >> made the current semantics more clear. Especially in light of the fact
> that
> >> it's not implemented fully until one uses a strict registrar (likely
> >> 0.20.0).
> >>
> >> Think of reconciliation as the fallback mechanism to ensure that state
> is
> >> consistent, it's not designed to be something to inform you of things
> you
> >> were already told (in this case, that the tasks were running). Although
> we
> >> could consider sending updates even when task state remains the same.
> >>
> >>
> >> For the purpose of this conversation, let's say we're in the 0.20.0
> >> world, operating with the registrar. And let's assume your goal is to
> build
> >> a highly available framework (I will be documenting how to do this for
> >> 0.20.0):
> >>
> >> (1) *When you receive a status update, you must persist this information
> >> before returning from the statusUpdate() callback*. Once you return from
> >> the callback, the driver will acknowledge the slave directly. Slaves
> will
> >> retry status update delivery *until* the acknowledgement is received
> from
> >> the scheduler driver in order to ensure that the framework processed the
> >> update.
> >>
> >> (2) *When you receive a "slave lost" signal, it means that your tasks
> >> that were running on that slave are in state TASK_LOST*, and any
> >> reconciliation you perform for these tasks will result in a reply of
> >> TASK_LOST. Most of the time we'll deliver these TASK_LOST automatically,
> >> but with a confluence of Master *and* Slave failovers, we are unaware of
> >> which tasks were running on the slave as we do not persist this
> information
> >> in the Master.
> >>
> >> (3) To guarantee that you have a consistent view of task states. *You
> >> must also periodically reconcile task state against the Master*. This is
> >> only because the delivery of the "slave lost" signal in (2) is not
> reliable
> >> (the Master could failover after removing a slave but before telling
> >> frameworks that the slave was lost).
> >>
> >> You'll notice that this model forces one to serially persist all status
> >> update changes. We are planning to expose mechanisms to allow "batch"
> >> acknowledgement of status updates in the lower-level API that benh has
> >> given talks about. With a lower-level API, it is possible to build more
> >> powerful libraries that hide much of these details!
> >>
> >> You'll also perhaps notice that only (1) and (3) are strictly required
> >> for consistency, but (2) is highly recommended as the vast majority of
> the
> >> time the "slave lost" signal will be delivered and you can take action
> >> quickly, without having to rely on periodic reconciliation.
> >>
> >> Please let me know if anything here was not clear!
> >>
> >>
> >> On Thu, Apr 17, 2014 at 1:47 PM, Sharma Podila <spodila@netflix.com
> >wrote:
> >>
> >> Should've looked at the code before sending the previous email...
> >>  master/main.cpp confirmed what I needed to know. It doesn't look like I
> >> will be able to use reconcileTasks the way I thought I could.
> Effectively,
> >> a lack of callback could either mean that the master agrees with the
> >> requested reconcile task state, or that the task and/or slave is
> currently
> >> unknown. Which makes it an unreliable source of data. I understand this
> is
> >> expected to improve later by leveraging the registrar, but, I suspect
> >> there's more to it.
> >>
> >> I take it then that individual frameworks need to have their own
> >> mechanisms to ascertain the state of their tasks.
> >>
> >>
> >> On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila <spodila@netflix.com
> >wrote:
> >>
> >> Hello
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message