Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1D24A20049D for ; Wed, 9 Aug 2017 19:06:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1B8BD1699CF; Wed, 9 Aug 2017 17:06:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 383CD1699D1 for ; Wed, 9 Aug 2017 19:06:09 +0200 (CEST) Received: (qmail 10428 invoked by uid 500); 9 Aug 2017 17:06:08 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 10390 invoked by uid 99); 9 Aug 2017 17:06:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Aug 2017 17:06:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DB87C1A051B for ; Wed, 9 Aug 2017 17:06:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id IpC47TPmceoq for ; Wed, 9 Aug 2017 17:06:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 4BCED60D08 for ; Wed, 9 Aug 2017 17:06:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 818B6E0E80 for ; Wed, 9 Aug 2017 17:06:03 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9E632241B9 for ; Wed, 9 Aug 2017 17:06:00 +0000 (UTC) Date: Wed, 9 Aug 2017 17:06:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2471) Java Zookeeper Client incorrectly considers time spent sleeping as time spent connecting, potentially resulting in infinite reconnect loop MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 09 Aug 2017 17:06:10 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2471?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 16120259#comment-16120259 ]=20 ASF GitHub Bot commented on ZOOKEEPER-2471: ------------------------------------------- GitHub user DanBenediktson opened a pull request: https://github.com/apache/zookeeper/pull/330 ZOOKEEPER-2471: ZK Java client should not count sleep time as connect t= ime ClientCnxnSocket uses a member variable "now" to track the current time= , but does not update it at all potentially-blocking times: in particular, = it does not update it after the random sleep introduced if an initial conne= ct attempt fails. This results in the random sleep time being counted towar= ds connect time, resulting in incorrect application of connection timeout c= urrently, and if ZOOKEEPER-2869 is taken, a very real possibility (we have = seen it in production) of wedging the Zookeeper client so that it can never= successfully reconnect, because its sleep time may grow beyond its connect= ion timeout, especially in scenarios where there is a big gap between negot= iated session timeout and client-requested session timeout. =20 Rather than fixing the bug by adding another "updateNow()" call, keepin= g the brittle "updateNow()" implementation which led to the bug in the firs= t place, I have deleted updateNow() and replaced usage of that member varia= ble with actually getting the current system timestamp whenever the impleme= ntation needs to know the current time. =20 Regarding unit testing, this is, IMO, too difficult to test without int= roducing a lot of invasive changes to ClientCnxn.java, seeing as the only e= ffective change is that, on connection retry, the random sleep time is no l= onger counted towards a time budget. I can throw a lot of mocks at this, li= ke ClientReconnectTest, but I'm still going to be stuck depending on the be= havior of that randomly-generated sleep time, which is going to be inherent= ly unreliable. If a fix is taken for ZOOKEEPER-2869, this should become muc= h easier to test, since I will then be able to inject a different backoff s= leep behavior, and since I'm planning to submit a pull request for that tic= ket as well, so maybe as a compromise I can submit a test for this bug fix = at that time? You can merge this pull request into a Git repository by running: $ git pull https://github.com/DanBenediktson/zookeeper ZOOKEEPER-2471 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/330.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #330 =20 ---- commit 60f38726e7f07b4bb970cc8fb089363ff48eb3df Author: Dan Benediktson Date: 2017-08-09T16:41:42Z ZOOKEEPER-2471: Zookeeper Java client should not count time spent sleep= ing as time spent connecting =20 Rather than keep the brittle "updateNow()" implementation which led to = the bug and fixing the bug by adding another "updateNow()" call, I have deleted updateNow() and repla= ced usage of that member variable with actually getting the current system timestamp. =20 This is, IMO, too difficult to test without introducing a lot of invasi= ve changes to ClientCnxn.java, seeing as the only effective change is that, on connection retry, a ran= dom sleep time is no longer counted towards a time budget. If a fix is taken for ZOOKEEPER-2869, th= is should become much easier to test, and since I'm planning to submit a pull request for that ticket a= s well, maybe as a compromise I can submit a test for this patch at that time? ---- > Java Zookeeper Client incorrectly considers time spent sleeping as time s= pent connecting, potentially resulting in infinite reconnect loop > -------------------------------------------------------------------------= ----------------------------------------------------------------- > > Key: ZOOKEEPER-2471 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471 > Project: ZooKeeper > Issue Type: Bug > Components: java client > Affects Versions: 3.5.1 > Environment: all > Reporter: Dan Benediktson > Assignee: Dan Benediktson > Attachments: ZOOKEEPER-2471.patch > > > ClientCnxnSocket uses a member variable "now" to track the current time, = and lastSend / lastHeard variables to track socket liveness. Implementation= s, and even ClientCnxn itself, are expected to call both updateNow() to res= et "now" to System.currentTimeMillis, and then call updateLastSend()/update= LastHeard() on IO completions. > This is a fragile contract, so it's not surprising that there's a bug res= ulting from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard() = as soon as startConnect() returns, but it does not call updateNow() first. = I expect when this was written, either the expectation was that startConnec= t() was an asynchronous operation and that updateNow() would have been call= ed very recently, or simply the requirement to call updateNow() was forgott= en at this point. As far as I can see, this bug has been present since the = "updateNow" method was first introduced in the distant past. As it turns ou= t, since startConnect() calls HostProvider.next(), which can sleep, quite a= lot of time can pass, leaving a big gap between "now" and now. > If you are using very short session timeouts (one of our ZK ensembles has= many clients using a 1-second timeout), this is potentially disastrous, be= cause the sleep time may exceed the connection timeout itself, which can po= tentially result in the Java client being stuck in a perpetual reconnect lo= op. The exact code path it goes through in this case is complicated, becaus= e there has to be a previously-closed socket still waiting in the selector = (otherwise, the first timeout evaluation will not fail because "now" still = hasn't been updated, and then the actual connect timeout will be applied in= ClientCnxnSocket.doTransport()) so that select() will harvest the IO from = the previous socket and updateNow(), resulting in the next loop through Cli= entCnxnSocket.SendThread.run() observing the spurious timeout and failing. = In practice it does happen to us fairly frequently; we only got to the bott= om of the bug yesterday. Worse, when it does happen, the Zookeeper client o= bject is rendered unusable: it's stuck in a perpetual reconnect loop where = it keeps sleeping, opening a socket, and immediately closing it. > I have a patch. Rather than calling updateNow() right after startConnect(= ), my fix is to remove the "now" member variable and the updateNow() method= entirely, and to instead just call System.currentTimeMillis() whenever tim= e needs to be evaluated. I realize there is a benefit (aside from a trivial= micro-optimization not worth worrying about) to having the time be "fixed"= , particularly for truth in the logging: if time is fixed by an updateNow()= call, then the log for a timeout will still show exactly the same value th= e code reasoned about. However, this benefit is in my opinion not enough to= merit the fragility of the contract which led to this (for us) highly impa= ctful and difficult-to-find bug in the first place. > I'm currently running ant tests locally against my patch on trunk, and th= en I'll upload it here. -- This message was sent by Atlassian JIRA (v6.4.14#64029)