Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1441B17FF7 for ; Thu, 22 Jan 2015 18:03:31 +0000 (UTC) Received: (qmail 41005 invoked by uid 500); 22 Jan 2015 18:03:30 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 40960 invoked by uid 500); 22 Jan 2015 18:03:30 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 40935 invoked by uid 99); 22 Jan 2015 18:03:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 18:03:30 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of barlock@us.ibm.com designates 32.97.110.150 as permitted sender) Received: from [32.97.110.150] (HELO e32.co.us.ibm.com) (32.97.110.150) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 18:03:22 +0000 Received: from /spool/local by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 22 Jan 2015 11:02:01 -0700 Received: from d03dlp02.boulder.ibm.com (9.17.202.178) by e32.co.us.ibm.com (192.168.1.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 22 Jan 2015 11:02:00 -0700 Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id 780603E4003B for ; Thu, 22 Jan 2015 11:01:59 -0700 (MST) Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t0MI1x6v34668612 for ; Thu, 22 Jan 2015 11:01:59 -0700 Received: from d03av01.boulder.ibm.com (localhost [127.0.0.1]) by d03av01.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t0MI1w4v030724 for ; Thu, 22 Jan 2015 11:01:58 -0700 Received: from d03nm119.boulder.ibm.com (d03nm119.boulder.ibm.com [9.63.40.225]) by d03av01.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id t0MI1vhv030674 for ; Thu, 22 Jan 2015 11:01:57 -0700 To: user@zookeeper.apache.org MIME-Version: 1.0 Subject: ZooKeeper TCP Port Connection Problem X-KeepSent: 03EDF594:CAC905AB-85257DD5:005D9EE9; type=4; name=$KeepSent X-Mailer: IBM Notes Release 9.0.1FP2 SHF37 August 25, 2014 From: Chris Barlock Message-ID: Date: Thu, 22 Jan 2015 13:01:55 -0500 X-MIMETrack: Serialize by Router on D03NM119/03/M/IBM(Release 9.0.1FP1|April 03, 2014) at 01/22/2015 11:01:57, Serialize complete at 01/22/2015 11:01:57 Content-Type: multipart/alternative; boundary="=_alternative 00630E8085257DD5_=" X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15012218-0005-0000-0000-0000083C7B5B X-Virus-Checked: Checked by ClamAV on apache.org --=_alternative 00630E8085257DD5_= Content-Type: text/plain; charset="US-ASCII" With my implementation of a ZK client, I see that just about all the time, there are around 2000 open socket connections to ZK according to netstat!!! Many of them are in the TIMED_WAIT state & will go away, but enough get created to keep the count fairly steady. Eventually ZK gets into a state in which I can't even connect the zkCli. On the web, I read that one should always be prepared to retry ZK API calls because they can fail for any number of reasons. I implemented methods for each of the ZK calls I make that retry the operation once and this did eliminate random ConnectionLoss KeeperExceptions I was seeing. I also implemented this method, which is called before every ZK operation to see if I have a valid ZK connection: private void connectZooKeeper() { final String methodName = "connectZooKeeper"; if (zk == null || zk.getState() != States.CONNECTED) { if (zk != null) { close(); } try { zk = new ZooKeeper(connectString, sessionTimeout, this); int connectAttempts = 0; while (zk.getState() != States.CONNECTED && connectAttempts < MAX_ZK_CONNECT_ATTEMPTS) { try { Thread.sleep(ZK_CONNECT_WAIT); } catch (InterruptedException e) { // Ignore } connectAttempts++; } } catch (IOException e) { trace.exception(CLASS_NAME, methodName, e); } if (zk.getState() != States.CONNECTED) { trace.textError(CLASS_NAME, methodName, "Unable to connect to ZooKeeper!"); } } } Here, close() simply calls ZooKeeper.close. sessionTimeout is five seconds. MAX_ZK_ATTEMPTS is 40 and ZK_CONNECT_WAIT is 50 ms for a max of two seconds (which I think is too short as I have seen cases in which I traced the "Unable to connec to ZK" message). Am I doing something poorly here that could be causing the excessively large number of TCP connections? It would seem that getState is not CONNECTED far more frequently than I expect, though I have not yet traced this to confirm. (On my to-do list.) We are using ZK 3.3.4, which is what ships with the version of Kafka we are using. Obviously, not current. Would stepping up to the current ZK version fix this problem? Thanks! Chris --=_alternative 00630E8085257DD5_=--