Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B9539200ABD for ; Sat, 14 May 2016 18:14:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B7F04160969; Sat, 14 May 2016 16:14:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0B6FB160868 for ; Sat, 14 May 2016 18:14:13 +0200 (CEST) Received: (qmail 10059 invoked by uid 500); 14 May 2016 16:14:13 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 10050 invoked by uid 99); 14 May 2016 16:14:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 May 2016 16:14:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D5A522C033A for ; Sat, 14 May 2016 16:14:12 +0000 (UTC) Date: Sat, 14 May 2016 16:14:12 +0000 (UTC) From: "Anand Mazumdar (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 14 May 2016 16:14:14 -0000 [ https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5361: ---------------------------------- Description: We currently don't use TCP KeepAlive's when creating sockets in libprocess. This might benefit master - scheduler, master - agent connections i.e. we can detect if any of them failed faster. Currently, if the master process goes down. If for some reason the {{RST}} sequence did not reach the scheduler, the scheduler can only come to know about the disconnection when it tries to do a {{send}} itself. The default TCP keep alive values on Linux are of little use in a real world application: {code} . This means that the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe, and then resend it every 75 seconds. If no ACK response is received for nine consecutive times, the connection is marked as broken. {code} However, for long running instances of scheduler/agent this still can be beneficial. Also, operators might start tuning the values for their clusters explicitly once we start supporting it. was: We currently don't use TCP KeepAlive's when creating sockets in libprocess. This might benefit master - scheduler, master - agent connections i.e. we can detect if any of them failed faster. Currently, if the master process goes down. If for some reason the {{RST}} sequence did not reach the scheduler, the scheduler can only come to know about the disconnection when it tries to do a {{send}} itself. The default TCP keep alive values on Linux are a joke though: {code} . This means that the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe, and then resend it every 75 seconds. If no ACK response is received for nine consecutive times, the connection is marked as broken. {code} However, for long running instances of scheduler/agent this still can be beneficial. Also, operators might start tuning the values for their clusters explicitly once we start supporting it. > Consider introducing TCP KeepAlive for Libprocess sockets. > ---------------------------------------------------------- > > Key: MESOS-5361 > URL: https://issues.apache.org/jira/browse/MESOS-5361 > Project: Mesos > Issue Type: Improvement > Components: libprocess > Reporter: Anand Mazumdar > Labels: mesosphere > > We currently don't use TCP KeepAlive's when creating sockets in libprocess. This might benefit master - scheduler, master - agent connections i.e. we can detect if any of them failed faster. > Currently, if the master process goes down. If for some reason the {{RST}} sequence did not reach the scheduler, the scheduler can only come to know about the disconnection when it tries to do a {{send}} itself. > The default TCP keep alive values on Linux are of little use in a real world application: > {code} > . This means that the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe, and then resend it every 75 seconds. If no ACK response is received for nine consecutive times, the connection is marked as broken. > {code} > However, for long running instances of scheduler/agent this still can be beneficial. Also, operators might start tuning the values for their clusters explicitly once we start supporting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)