Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B72CA200C73 for ; Wed, 26 Apr 2017 07:43:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B5CB2160BB6; Wed, 26 Apr 2017 05:43:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 00B71160BB3 for ; Wed, 26 Apr 2017 07:43:09 +0200 (CEST) Received: (qmail 83125 invoked by uid 500); 26 Apr 2017 05:43:09 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 83112 invoked by uid 99); 26 Apr 2017 05:43:08 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Apr 2017 05:43:08 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 95BA5DFBB7; Wed, 26 Apr 2017 05:43:08 +0000 (UTC) From: hanm To: dev@zookeeper.apache.org Reply-To: dev@zookeeper.apache.org References: In-Reply-To: Subject: [GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve... Content-Type: text/plain Message-Id: <20170426054308.95BA5DFBB7@git1-us-west.apache.org> Date: Wed, 26 Apr 2017 05:43:08 +0000 (UTC) archived-at: Wed, 26 Apr 2017 05:43:10 -0000 Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/229 Thanks for work on this! I hope we have more contributors focus on improving the test. The proposed fix masks the real problem - it will always success no matter what - even if the connection was succeeded, since the halt will clean up the sender worker map of the peer 0. >> a race condition between the removal of the relevant sid from senderWorkerMap for peer0 and the 3 second delay from assertEventuallyNotConnected. I think there is a race, but my observation on it is slightly different. The race is between when peer 0 received (and finished) establish connection with peer 1, and the assert inside assertEventuallyNotConnected. >> may not destroy them in time for the test assertion. Because of the race, once peer 0 has received the connection, its sender worker map will has sid 1 entry forever until the death of the peer 0 itself. I think instead of checking on peer 0, we can check peer 1. Its sender worker map should always be clean. We can't check peer 0 here because the current API does not encode the true state of the connection - it only encodes what happened (the connection from 1 to 0 was established on 0's side), but it does not encode the status of the connection (is the connection still alive or not - in this case peer 1 will close the socket later when it found out peer 0 has no auth. but on peer 0 side the sender work will have sid 1 inside until its death, what's done has been done.). The state of peer 1 wrt its sender map should also reflect state of peer 0 since there is only a single connection between 0 and 1. Something like this should serve the purpose: assertEventuallyNotConnected(peer1, 0); Maybe add a sleep before it to make sure all the connection attempts were made (from 0 to 1 and from 1 to 0 two ways), but it's a best effort. 5 sec sounds enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---