mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@apache.org>
Subject Re: Agent reregistration timeout, no TASK_LOST messages
Date Tue, 18 Jul 2017 00:14:33 GMT
On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> When there is no master fail over and agents join back after the default
> 5*15 timeout, we do see tasks getting killed like it used to. Because in
> this case master has sent task lost to framework.
> But we are noticing shutdown() executor callback not getting invoked. We
> started a different thread on it. This is mesos 1.1.
>
> Are you trying to say tasks will leak in latest versions and again relies
> on recon for the regular health check timeout scenario and agent joining
> back?
>

There should be no task leaks. After partition awareness code has landed,
the master no longer shuts down the agents in the above scenario but it
still shuts down the tasks/executors of the non-partition-aware frameworks.
So the observable behavior for a framework regarding its tasks/executors
should not change. The one observable change is that frameworks do not get
`LostSlaveMessage` (`lostSlave()` callback on the driver) in this case.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message