mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yan Xu <...@jxu.me>
Subject Re: Agent reregistration timeout, no TASK_LOST messages
Date Mon, 17 Jul 2017 22:13:26 GMT
On Mon, Jul 17, 2017 at 9:34 AM, Neil Conway <neil.conway@gmail.com> wrote:

> On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin <ipronin@twopensource.com>
> wrote:
>
> > AFAIK the absence of TASK_LOST statuses is expected. Master registry
> > persists information only about agents. Tasks are recovered from
> > re-registering agents. Because of that the failed over master can't send
> > TASK_LOST for tasks that were running on the agent that didn't
> re-register,
> > it simply doesn't know about them. The only thing the master can do in
> this
> > situation is send LostSlaveMessage that will tell the scheduler that
> tasks
> > on this agent are LOST/UNREACHABLE.
> >
>
> +1.
>
> The situation where the agent came back after reregistration timeout
> > doesn't sound good. The only way for the framework to learn about tasks
> > that are still running on such agent is either from status updates or via
> > implicit reconciliation. Perhaps, the master could send updates for tasks
> > it learned about when such agent is readmitted?
> >
>
> I agree this would be a good idea:
> https://issues.apache.org/jira/browse/MESOS-6406
>
> I haven't had a chance to implement it yet, but if someone is interested, I
> think this would be a pretty nicely scoped project.
>

The master should probably send updates about non-partition-aware framework
tasks as well. Especially in light of MESOS-7215 for which we are going to
stop killing tasks in all cases.


>
> Neil
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message