Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (athena.apache.org: domain of barlock@us.ibm.com designates
 32.97.110.150 as permitted sender)
To: user@zookeeper.apache.org
MIME-Version: 1.0
Subject: ZooKeeper TCP Port Connection Problem
From: Chris Barlock <barlock@us.ibm.com>
Message-ID: 
 <OF03EDF594.CAC905AB-ON85257DD5.005D9EE9-85257DD5.00630EF3@us.ibm.com>
Date: Thu, 22 Jan 2015 13:01:55 -0500
Content-Type: multipart/alternative;
 boundary="=_alternative 00630E8085257DD5_="

--=_alternative 00630E8085257DD5_=
Content-Type: text/plain; charset="US-ASCII"

With my implementation of a ZK client, I see that just about all the time, 
there are around 2000 open socket connections to ZK according to 
netstat!!!  Many of them are in the TIMED_WAIT state & will go away, but 
enough get created to keep the count fairly steady.  Eventually ZK gets 
into a state in which I can't even connect the zkCli.  On the web, I read 
that one should always be prepared to retry ZK API calls because they can 
fail for any number of reasons.  I implemented methods for each of the ZK 
calls I make that retry the operation once and this did eliminate random 
ConnectionLoss KeeperExceptions I was seeing.  I also implemented this 
method, which is called before every ZK operation to see if I have a valid 
ZK connection:

    private void connectZooKeeper() {
        final String methodName = "connectZooKeeper";
 
        if (zk == null || zk.getState() != States.CONNECTED) {
            if (zk != null) {
                close();
            }
            try {
                zk = new ZooKeeper(connectString, sessionTimeout, this);
                int connectAttempts = 0;
                while (zk.getState() != States.CONNECTED && 
connectAttempts < MAX_ZK_CONNECT_ATTEMPTS) {
                    try {
                        Thread.sleep(ZK_CONNECT_WAIT);
                    } catch (InterruptedException e) {
                        // Ignore
                    }
                    connectAttempts++;
                }
            } catch (IOException e) {
                trace.exception(CLASS_NAME, methodName, e);
            }
            if (zk.getState() != States.CONNECTED) {
                trace.textError(CLASS_NAME, methodName, "Unable to connect 
to ZooKeeper!"); 
            }
        }
    }

Here, close() simply calls ZooKeeper.close.  sessionTimeout is five 
seconds.  MAX_ZK_ATTEMPTS is 40 and ZK_CONNECT_WAIT is 50 ms for a max of 
two seconds (which I think is too short as I have seen cases in which I 
traced the "Unable to connec to ZK" message).

Am I doing something poorly here that could be causing the excessively 
large number of TCP connections?  It would seem that getState is not 
CONNECTED far more frequently than I expect, though I have not yet traced 
this to confirm.  (On my to-do list.)

We are using ZK 3.3.4, which is what ships with the version of Kafka we 
are using.  Obviously, not current.  Would stepping up to the current ZK 
version fix this problem?

Thanks!

Chris
--=_alternative 00630E8085257DD5_=--