incubator-mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Greenberg <>
Subject Re: Question about TASK_LOST statuses
Date Sat, 18 May 2013 19:40:56 GMT
I am looking at the slave's logs, and here's what I see:
- 81 instances of "Telling slave of lost executor XXX of framework YYY"
- 500,000+ instances of "Failed to collect resource usage for executor XXX
of framework YYY"
- 8 instances of "WARNING! executor XXX of framework YYY should be shutting

On the master's logs, I see this:
- 5600+ instances of "Error validating task XXX: Task uses invalid slave:

What do you think the problem is? I am copying the slave_id from the offer
into the TaskInfo protobuf.

I'm using the process-based isolation at the moment (I haven't had the time
to set up the cgroups isolation yet).

I can find and share whatever else is needed so that we can figure out why
these messages are occurring.


On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <> wrote:

> Hi David,
> You are right in that all these status updates are what we call "terminal"
> status updates and mesos takes specific actions when it gets/generates one
> of these.
> TASK_LOST is special in the sense that is not generated by the executor,
> but by the slave/master. You could think of it as an exception in mesos.
> Clearly, these should be rare in a stable mesos system.
> What do your logs say about the TASK_LOSTs? Is it always the same issue?
> Are you running w/ cgroups?
> On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> >wrote:
> > Hello! Today I began working on a more advanced version of mesos-submit
> > that will handle hot-spares.
> >
> > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the status
> > updates that meant that I needed to start a new spare process, as the
> > monitored task was killed. However, I noticed that I often recieved
> > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks had
> all
> > died, so it'd restart too many. Nevertheless, the tasks would reappear
> > later on, and I could see them in the web interface of Mesos, continuing
> to
> > run.
> >
> > What is going on?
> >
> > Thanks!
> > David
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message