zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Powell Molleti <pmoll...@vmware.com>
Subject Re: quorum connection manager shutdown takes long time
Date Tue, 01 Sep 2015 21:49:57 GMT
Apologies for not posting the link to the old thread, here it is:


On 8/31/15, 2:34 PM, "Powell Molleti" <pmolleti@vmware.com> wrote:

>In reference to:
>Plainly removing  sock.setSoTimeout(0) from
>lzYICW65qMs-kxwcASfZGRMQKh_67Ot4EpzPW4k&e=  has the unintended
>consequence of shutting down both the RecvWorker and SendWorker threads
>for all cases. Seems like current code is designed to  keep the socket
>alive (and threads to keep running) so as to reuse this channel to
>communicate again with the the peer node which still alive but needs to
>redo leader election.
>I could not reproduce any issue if threads shutdown after the timeout
>since new threads are created for next iteration of leader election. I
>rather would like to reuse the threads and the channel hence I propose
>the following approach.
>The alternative I suggest is to still remove setSoTimeout(0) from here:
>lzYICW65qMs-kxwcASfZGRMQKh_67Ot4EpzPW4k&e=   , also enable SO_KEEPALIVE
>via setKeepAlive() on this socket and do not consider it an error when
>timeout occurs here:
>jYwu8LPG_s1B6_rlPeoZFTNj8PrRET3yEAg6A&e=  but consider it an error when
>it happens here: 
>This means that users can play with keep alive timeouts for TCP sockets
>to quicken TCP socket failures propagating to user-space and zookeeper
>also resets the socket if it detects other side is not responding when it
>knows it needs a response within some bounded time.
>Ideally I wish there is some userspace pings of every socket channel
>between zookeeper nodes to detect dead channels quickly. Seems like one
>exists for sockets that do Follow/Lead after leader election is done but
>not for this?. Such a feature could be added with care towards making it
>backward compatible.
>I posted the above text to Jira. Also please point out any wrong
>assumptions I have made and provide comments and suggestions.
>> From Raúl Gutiérrez Segalés <...@itevenworks.net>
>> Subject Re: quorum connection manager shutdown takes long time
>> Date Thu, 10 Jul 2014 18:02:37 GMT
>> On 9 July 2014 08:28, Michi Mutsuzaki <michi@cs.stanford.edu> wrote:
>>> I don't know how I missed that :) QA said this is reproducible, so
>>> I'll try commenting this line out. Thanks Flavio!
>> I am curious, was it that?
>> -rgs

View raw message