Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C46617FD1 for ; Mon, 31 Aug 2015 21:34:54 +0000 (UTC) Received: (qmail 94625 invoked by uid 500); 31 Aug 2015 21:34:53 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 94569 invoked by uid 500); 31 Aug 2015 21:34:53 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 94558 invoked by uid 99); 31 Aug 2015 21:34:53 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Aug 2015 21:34:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1703B1826CC for ; Mon, 31 Aug 2015 21:34:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.974 X-Spam-Level: ** X-Spam-Status: No, score=2.974 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.006, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id TgUzkQAJczqW for ; Mon, 31 Aug 2015 21:34:43 +0000 (UTC) Received: from smtp-outbound-2.vmware.com (smtp-outbound-2.vmware.com [208.91.2.13]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id D71035071E for ; Mon, 31 Aug 2015 21:34:42 +0000 (UTC) Received: from sc9-mailhost2.vmware.com (sc9-mailhost2.vmware.com [10.113.161.72]) by smtp-outbound-2.vmware.com (Postfix) with ESMTP id BD01028323 for ; Mon, 31 Aug 2015 14:34:52 -0700 (PDT) Received: from EX13-CAS-013.vmware.com (ex13-cas-013.vmware.com [10.113.191.65]) by sc9-mailhost2.vmware.com (Postfix) with ESMTP id 9507AB0715 for ; Mon, 31 Aug 2015 14:34:39 -0700 (PDT) Received: from EX13-MBX-016.vmware.com (10.113.191.36) by EX13-MBX-006.vmware.com (10.113.191.26) with Microsoft SMTP Server (TLS) id 15.0.1076.9; Mon, 31 Aug 2015 14:34:39 -0700 Received: from EX13-MBX-016.vmware.com ([fe80::c1b7:ea9d:768a:37a9]) by EX13-MBX-016.vmware.com ([fe80::c1b7:ea9d:768a:37a9%15]) with mapi id 15.00.1076.010; Mon, 31 Aug 2015 14:34:39 -0700 From: Powell Molleti To: "user@zookeeper.apache.org" Subject: Re: quorum connection manager shutdown takes long time Thread-Topic: quorum connection manager shutdown takes long time Thread-Index: AQHQ5DTWskNMyvTcokmFXnrOlg2zQA== Date: Mon, 31 Aug 2015 21:34:38 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.113.160.246] Content-Type: multipart/alternative; boundary="_000_D20A167E120Cpmolletivmwarecom_" MIME-Version: 1.0 --_000_D20A167E120Cpmolletivmwarecom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable In reference to: https://issues.apache.org/jira/browse/ZOOKEEPER-2246 Plainly removing sock.setSoTimeout(0) from http://s.apache.org/TfI has the= unintended consequence of shutting down both the RecvWorker and SendWorker= threads for all cases. Seems like current code is designed to keep the so= cket alive (and threads to keep running) so as to reuse this channel to com= municate again with the the peer node which still alive but needs to redo l= eader election. I could not reproduce any issue if threads shutdown after the timeout since= new threads are created for next iteration of leader election. I rather wo= uld like to reuse the threads and the channel hence I propose the following= approach. The alternative I suggest is to still remove setSoTimeout(0) from here: htt= p://s.apache.org/TfI , also enable SO_KEEPALIVE via setKeepAlive() on this= socket and do not consider it an error when timeout occurs here: http://bi= t.ly/1JHIdVY but consider it an error when it happens here: http://bit.ly/1= NTjQ9R This means that users can play with keep alive timeouts for TCP sockets to = quicken TCP socket failures propagating to user-space and zookeeper also re= sets the socket if it detects other side is not responding when it knows it= needs a response within some bounded time. Ideally I wish there is some userspace pings of every socket channel betwee= n zookeeper nodes to detect dead channels quickly. Seems like one exists fo= r sockets that do Follow/Lead after leader election is done but not for thi= s?. Such a feature could be added with care towards making it backward comp= atible. I posted the above text to Jira. Also please point out any wrong assumption= s I have made and provide comments and suggestions. Thanks Powell. > From Ra=FAl Guti=E9rrez Segal=E9s <...@itevenworks.net> > Subject Re: quorum connection manager shutdown takes long time > Date Thu, 10 Jul 2014 18:02:37 GMT > On 9 July 2014 08:28, Michi Mutsuzaki wrote: >> I don't know how I missed that :) QA said this is reproducible, so >> I'll try commenting this line out. Thanks Flavio! >> > I am curious, was it that? > -rgs --_000_D20A167E120Cpmolletivmwarecom_--