mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evers Benno <>
Subject Re: Status acknowledgements in MesosExecutor
Date Wed, 01 Jun 2016 17:27:03 GMT
Some more context about this bug:

We did some tests with a framework that does nothing but send empty
tasks and sample executor that does nothing but send TASK_FINISHED and
shut itself down.

Running on two virtual machines on the same host (i.e. no network
involved), we see TASK_FAILED in about 3% of all tasks (271 out of
9000). Adding some megabytes of data into, this can go up
to 80%. In all cases where I looked manually, the logs look like this:
(id's shortened to three characters for better readability)

I0502 14:40:33.151075 394179 slave.cpp:3002] Handling status update
TASK_FINISHED (UUID: 20c) for task 24c of framework f20 from
I0502 14:40:33.151175 394179 slave.cpp:3528]
executor(1)@[2a02:6b8:0:1a16::165]:49266 exited
I0502 14:40:33.151190 394179 slave.cpp:3886] Executor 'executor_24c' of
framework f20 exited with status 0
I0502 14:40:33.151216 394179 slave.cpp:3002] Handling status update
TASK_FAILED (UUID: 01b) for task 24c of framework f20 from @

The random failure chance is a bit too high to ignore, so we're
currently writing/testing a patch to wait for confirmations for all
status updates on executor shutdown.

It would be great if someone would like to shepherd this.

Best regards,

On 03.05.2016 14:49, Evers Benno wrote:
> Hi,
> I was wondering about the semantics of the Executor::sendStatusUpdate()
> method. It is described as
>     // Sends a status update to the framework scheduler, retrying as
>     // necessary until an acknowledgement has been received or the
>     // executor is terminated (in which case, a TASK_LOST status update
>     // will be sent). See Scheduler::statusUpdate for more information
>     // about status update acknowledgements.
> I was understanding this to say that the function blocks until an
> acknowledgement is received, but looking at the implementation of
> MesosExecutor it seems that "retrying as necessary" only means
> re-sending of unacknowledged updates when the slave reconnects.
> (exec/exec.cpp:274)
> I'm wondering because we're currently running a python executor which
> ends its life like this:
>     driver.sendStatusUpdate(_create_task_status(TASK_FINISHED))
>     driver.stop()
>     # in a different thread:
>     sys.exit(0 if == mesos_pb2.DRIVER_STOPPED else 1)
> and we're seeing situations (roughly once per 10,000 tasks) where the
> executor process is reaped before the acknowledgement for TASK_FINISHED
> is sent from slave to executor. This results in mesos generating a
> TASK_FAILED status update, probably from
> Slave::sendExecutorTerminatedStatusUpdate().
> So, did I misunderstand how MesosExecutor works? Or is it indeed a race,
> and we have to change the executor shutdown?
> Best regards,
> Benno

View raw message