mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anand Mazumdar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.
Date Sat, 14 May 2016 16:14:12 GMT

     [ https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Anand Mazumdar updated MESOS-5361:
----------------------------------
    Description: 
We currently don't use TCP KeepAlive's when creating sockets in libprocess. This might benefit
master - scheduler, master - agent connections i.e. we can detect if any of them failed faster.

Currently, if the master process goes down. If for some reason the {{RST}} sequence did not
reach the scheduler, the scheduler can only come to know about the disconnection when it tries
to do a {{send}} itself. 

The default TCP keep alive values on Linux are of little use in a real world application:
{code}
. This means that the keepalive routines wait for two hours (7200 secs) before sending the
first keepalive probe, and then resend it every 75 seconds. If no ACK response is received
for nine consecutive times, the connection is marked as broken.
{code}

However, for long running instances of scheduler/agent this still can be beneficial. Also,
operators might start tuning the values for their clusters explicitly once we start supporting
it.

  was:
We currently don't use TCP KeepAlive's when creating sockets in libprocess. This might benefit
master - scheduler, master - agent connections i.e. we can detect if any of them failed faster.

Currently, if the master process goes down. If for some reason the {{RST}} sequence did not
reach the scheduler, the scheduler can only come to know about the disconnection when it tries
to do a {{send}} itself. 

The default TCP keep alive values on Linux are a joke though:
{code}
. This means that the keepalive routines wait for two hours (7200 secs) before sending the
first keepalive probe, and then resend it every 75 seconds. If no ACK response is received
for nine consecutive times, the connection is marked as broken.
{code}

However, for long running instances of scheduler/agent this still can be beneficial. Also,
operators might start tuning the values for their clusters explicitly once we start supporting
it.


> Consider introducing TCP KeepAlive for Libprocess sockets.
> ----------------------------------------------------------
>
>                 Key: MESOS-5361
>                 URL: https://issues.apache.org/jira/browse/MESOS-5361
>             Project: Mesos
>          Issue Type: Improvement
>          Components: libprocess
>            Reporter: Anand Mazumdar
>              Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. This might
benefit master - scheduler, master - agent connections i.e. we can detect if any of them failed
faster.
> Currently, if the master process goes down. If for some reason the {{RST}} sequence did
not reach the scheduler, the scheduler can only come to know about the disconnection when
it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) before sending
the first keepalive probe, and then resend it every 75 seconds. If no ACK response is received
for nine consecutive times, the connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be beneficial.
Also, operators might start tuning the values for their clusters explicitly once we start
supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message