mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Erb (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-7057) Consider using the relink functionality of libprocess in the executor driver.
Date Tue, 21 Feb 2017 19:54:45 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876592#comment-15876592
] 

Stephan Erb commented on MESOS-7057:
------------------------------------

Thanks for fixing this! :-)

> Consider using the relink functionality of libprocess in the executor driver.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7057
>                 URL: https://issues.apache.org/jira/browse/MESOS-7057
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Anand Mazumdar
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>             Fix For: 1.2.0
>
>
> As outlined in the root cause analysis for MESOS-5332, it is possible for a iptables
firewall to terminate an idle connection after a timeout. (the default is 5 days). Once this
happens, the executor driver is not notified of the disconnection. It keeps on thinking that
it is still connected with the agent.
> When the agent process is restarted, the executor still tries to re-use the old broken
connection to send the re-register message to the agent. This is when it eventually realizes
that the connection is broken (due to the nature of TCP) and calls the {{exited}} callback
and commits suicide in 15 minutes upon the recovery timeout.
> To offset this, an executor should always {{relink}} when it receives a reconnect request
from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message