mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Erb (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
Date Sat, 07 May 2016 17:56:12 GMT

    [ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275326#comment-15275326
] 

Stephan Erb commented on MESOS-5332:
------------------------------------

My analysis of the events is slightly different and leads to a different conclusion:

1. Executor and agent are running for several days. There are no events send from executor
to agent during that time. This leads to some persistent connections from executors to the
agent silently becoming stale.
2. The agent is restarted. Executors with working TCP connections detect this, those with
faulty ones however don't.
3. The agent process sends a {{ReconnectExecutorMessage}}.
4. The executors try to  send {{ExecutorReregisterMessage}} as a reply. Due to the nature
of TCP, the faulty executors are only learning on this write attempt that their persistent
connections have become stale, i.e. they generate the {{ExitedEvent}} only now.

Proposed fix: We have to discover that the persistent connections have become stale before
we actually need them. We can achieve this by periodically sending data over the socket. TCP
will then discover that the connection has gone bad. There seem to be two approaches here:

a) We enable layer 4 connection health checking by using TCP keepalives (via {{SO_KEEPALIVE}})
on all sockets.
b) We make use of layer 7 connection health checking by periodically sending a no-op / whitespace
over the persistent HTTP connection.

FWIW this recent SIGPIPE commit might have already improved the closed socket detection https://github.com/apache/mesos/commit/b0dcf4183a9125d0346688fa4f413658508aae68

> TASK_LOST on slave restart potentially due to executor race condition
> ---------------------------------------------------------------------
>
>                 Key: MESOS-5332
>                 URL: https://issues.apache.org/jira/browse/MESOS-5332
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess, slave
>    Affects Versions: 0.26.0
>         Environment: Mesos 0.26
> Aurora 0.13
>            Reporter: Stephan Erb
>         Attachments: executor-stderr.log, executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to
50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent during startup.
When stopping the agent using supervisorctl stop, the executors are still running (verified
via {{ps aux}}). They are only killed once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for the first
restart.
> * It is much more likely to occur if the agent binary has been running for a longer period
of time (> 7 days)
> * It tends to be more likely if the host has many cores (30-40) and thus many libprocess
workers. 
> Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered
executors. The failed executors receive the reregistration request, but it seems like they
fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing
enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport
endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has checkpointing
enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, i.e. messages
seem to be received in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message