zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Rose <ianr...@fullstory.com>
Subject Re: Heartbeats not being received / responded to?
Date Wed, 21 Jan 2015 19:00:01 GMT
Hi all -

Reviving an old thread here.  We have now been running Zookeeper for a few
months now and this problem has continued to dog us.  This is entirely a
localhost problem (when zookeeper and the client app are both running on a
developer laptop) - thankfully everything has been solid on our production
servers.  Nonetheless this causes major headaches in our development
process.  For example, I just now debugged a problem our UX lead was having
getting our dev stack up and running on his local machine; the problem was
that our startup script was failing when using zkCli.sh to create a "/solr"
node in ZK.  This would not be a problem if it was a rare/spurious error,
but he is getting this error every single time.

For reference, here is the complete stack trace (which appears after
several seconds - the failure is not immediate):

Exception in thread "main"
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /solr
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
    at
org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:695)
    at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588)
    at
org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360)
    at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323)
    at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)

Frustratingly, this behavior is very inconsistent.  In fact, after 5+
failures in a row on our UX lead's machine, now I just heard from him that
he tried again and it worked.  Could it be that there is some kind of state
on Zookeeper that needs to timeout and disappear before things will work?

More concretely, the next time this happens, what kind of info should I
capture to make this more debuggable?

Many thanks,
Ian



On Mon, Sep 29, 2014 at 4:51 PM, Ian Rose <ianrose@fullstory.com> wrote:

> Well unfortunately although I am able to repro this problem with a
> completely new, clean ZK install on my laptop, I *cannot* repro with the
> same on my coworkers laptop.  So unfortunately I am forced to conclude that
> there is something strange going on locally.
>
> Thanks for the help anyhow!
>
> - Ian
>
>
> On Mon, Sep 29, 2014 at 9:40 AM, Camille Fournier <camille@apache.org>
> wrote:
>
>> No that is not expected. Odd that you get disconnected once and then
>> reconnect fine. Does the same thing happen in your kazoo clients, one
>> disconnect but then the second connect is ok?
>> Which version of ZK are you running? Are you running this with some sort
>> of
>> auth or password to the zk server?
>>
>> Thanks,
>> C
>>
>> On Mon, Sep 29, 2014 at 9:24 AM, Ian Rose <ianrose@fullstory.com> wrote:
>>
>> > Perhaps I'm simply misunderstanding what the expected behavior would be.
>> > Why would my client by disconnected?  Does zookeeper drop idle clients?
>> > And note that this isn't a spurious disconnect; my client is *always*
>> > dropped
>> > at that time.
>> >
>> > On Friday zkCli seemed to be working just fine for me, but now I am
>> getting
>> > disconnections similar to my kazoo-based client.
>> >
>> > Here is a session showing `ls /' failing.  This behavior is reproducible
>> > for me currently.
>> >
>> > $ ./bin/zkCli.sh
>> > > Connecting to localhost:2181
>> > > Welcome to ZooKeeper!
>> > > JLine support is enabled
>> > > [zk: localhost:2181(CONNECTING) 0]
>> > > WATCHER::
>> > > WatchedEvent state:SyncConnected type:None path:null
>> > > [zk: localhost:2181(CONNECTED) 0]
>> > > [zk: localhost:2181(CONNECTED) 0] ls /
>> > > WATCHER::
>> > > WatchedEvent state:Disconnected type:None path:null
>> > > Exception in thread "main"
>> > > org.apache.zookeeper.KeeperException$ConnectionLossException:
>> > > KeeperErrorCode = ConnectionLoss for /
>> > > at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>> > > at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>> > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>> > > at
>> > org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:720)
>> > > at
>> org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588)
>> > > at
>> org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360)
>> > > at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323)
>> > > at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)
>> >
>> >
>> > And here is a session where after connecting I just sit and wait for a
>> bit
>> > (without sending any commands), whereupon I get disconnected and then
>> > immediately reconnected.  Interestingly, after the reconnecting I am
>> able
>> > to issue multiple commands without any issues.  This behavior is
>> > reproducible for me currently.
>> >
>> > $ ./bin/zkCli.sh
>> > > Connecting to localhost:2181
>> > > Welcome to ZooKeeper!
>> > > JLine support is enabled
>> > > [zk: localhost:2181(CONNECTING) 0]
>> > > [zk: localhost:2181(CONNECTING) 0]
>> > > WATCHER::
>> > > WatchedEvent state:SyncConnected type:None path:null
>> > > [zk: localhost:2181(CONNECTED) 0]
>> > > WATCHER::
>> > > WatchedEvent state:Disconnected type:None path:null
>> > > WATCHER::
>> > > WatchedEvent state:SyncConnected type:None path:null
>> > > [zk: localhost:2181(CONNECTED) 0]
>> > > [zk: localhost:2181(CONNECTED) 0] ls /
>> > > [zookeeper]
>> > > [zk: localhost:2181(CONNECTED) 1]
>> > > [zk: localhost:2181(CONNECTED) 1]
>> > > [zk: localhost:2181(CONNECTED) 1] ls /zookeeper
>> > > [quota]
>> > > [zk: localhost:2181(CONNECTED) 2] ls /zookeeper/quota
>> > > []
>> >
>> >
>> >
>> > Thanks,
>> > - Ian
>> >
>> >
>> > On Sun, Sep 28, 2014 at 2:22 PM, Camille Fournier <camille@apache.org>
>> > wrote:
>> >
>> > > Sorry but with what you've sent us I don't really see what the problem
>> > is.
>> > > It does look like you connect and then nothing happens for 20s and
>> then
>> > the
>> > > connection is dropped. If you use the zkCli script to connect via the
>> > > command line do you see the same problem?
>> > >
>> > > C
>> > >
>> > > On Fri, Sep 26, 2014 at 12:48 PM, Ian Rose <ianrose@fullstory.com>
>> > wrote:
>> > >
>> > > > Hi all -
>> > > >
>> > > > I've just gotten started using SolrCloud, which uses Zookeeper for
>> > > > coordination; I am otherwise completely new to Zookeeper.  Now I am
>> > > trying
>> > > > to query Zookeeper directly for some simple information.  I am
>> finding,
>> > > > however, that although my clients are able to connect they very
>> > > frequently
>> > > > receive timeouts.  It almost seems like the server isn't receiving
>> the
>> > > > heartbeat messages at all (or isn't responding to them).  I've seen
>> > > similar
>> > > > behavior both when using the go-zookeeper
>> > > > <https://godoc.org/github.com/samuel/go-zookeeper/zk> and kazoo
>> > > > <https://github.com/python-zk/kazoo> (python) client libraries
(I
>> > wanted
>> > > > to
>> > > > try >1 to ensure that it wasn't a client lib problem).
>> > > >
>> > > > My config is very simple: I am running the client and a single
>> > Zookeeper
>> > > > node on the same machine (my laptop).  There are no other clients
of
>> > the
>> > > > Zookeeper node while I am running these tests, so there is no
>> practical
>> > > > possibility that the JVM is overloaded or GCing.
>> > > >
>> > > > Here is a very basic kazoo client that I am using.  Obviously it
>> isn't
>> > > > doing any "real" work right now - this is just to demonstrate the
>> > > > disconnects.
>> > > >
>> > > > #!/usr/bin/env python2.7
>> > > >
>> > > > from kazoo.client import KazooClient, KazooState
>> > > > import logging
>> > > > import time
>> > > >
>> > > > def my_listener(state):
>> > > >   if state == KazooState.LOST:
>> > > >     # Register somewhere that the session was lost
>> > > >     logging.warning('handle lost')
>> > > >   elif state == KazooState.SUSPENDED:
>> > > >     # Handle being disconnected from Zookeeper
>> > > >     logging.debug('handle being disconnected')
>> > > >   else:
>> > > >     # Handle being connected/reconnected to Zookeeper
>> > > >     logging.debug('handle being (re)connected')
>> > > >
>> > > >
>> > > > if __name__ == '__main__':
>> > > >   logging.basicConfig(format='%(asctime)-15s %(levelname)s
>> > %(message)s',
>> > > > level=logging.DEBUG)
>> > > >
>> > > >   logging.debug('starting...')
>> > > >   zk = KazooClient(hosts='127.0.0.1:2181', timeout=30)
>> > > >   zk.start()
>> > > >   zk.add_listener(my_listener)
>> > > >
>> > > >   time.sleep(35)
>> > > >
>> > > >   zk.stop()
>> > > >
>> > > > --------------------
>> > > >
>> > > > Here is my zookeeper config:
>> > > >
>> > > > # The number of milliseconds of each tick
>> > > > tickTime=2000
>> > > >
>> > > > # The number of ticks that the initial
>> > > > # synchronization phase can take
>> > > > initLimit=10
>> > > >
>> > > > # The number of ticks that can pass between
>> > > > # sending a request and getting an acknowledgement
>> > > > syncLimit=5
>> > > >
>> > > > # the directory where the snapshot is stored.
>> > > > dataDir=/Users/ianrose/Code/zookeeper/var/data
>> > > >
>> > > > # the port at which the clients will connect
>> > > > clientPort=2181
>> > > >
>> > > > # The number of snapshots to retain in dataDir
>> > > > autopurge.snapRetainCount=5
>> > > >
>> > > > # Purge task interval in hours
>> > > > # Set to "0" to disable auto purge feature
>> > > > autopurge.purgeInterval=1
>> > > >
>> > > >
>> > > > --------------------
>> > > >
>> > > >
>> > > > Here is the output I get on the client:
>> > > >
>> > > > 2014-09-26 12:43:20,603 DEBUG starting...
>> > > > 2014-09-26 12:43:20,604 INFO Connecting to 127.0.0.1:2181
>> > > > 2014-09-26 12:43:20,605 DEBUG Sending request(xid=None):
>> > > > Connect(protocol_version=0, last_zxid_seen=0, time_out=30000,
>> > > session_id=0,
>> > > >
>> > >
>> >
>> passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
>> > > > read_only=None)
>> > > > 2014-09-26 12:43:20,609 INFO Zookeeper connection established,
>> state:
>> > > > CONNECTED
>> > > > 2014-09-26 12:43:40,222 WARNING Connection dropped: outstanding
>> > heartbeat
>> > > > ping not received
>> > > > 2014-09-26 12:43:40,222 WARNING Transition to CONNECTING
>> > > > 2014-09-26 12:43:40,222 INFO Zookeeper connection lost
>> > > > 2014-09-26 12:43:40,222 DEBUG handle being disconnected
>> > > > 2014-09-26 12:43:40,747 INFO Connecting to 127.0.0.1:2181
>> > > > 2014-09-26 12:43:40,748 DEBUG Sending request(xid=None):
>> > > > Connect(protocol_version=0, last_zxid_seen=0, time_out=30000,
>> > > > session_id=92520388231233540,
>> > > > passwd='\xd1M\xb9\xb3\xae\xab\xa1!@x\x06nv\xb7\xe3*',
>> read_only=None)
>> > > > 2014-09-26 12:43:40,750 INFO Zookeeper connection established,
>> state:
>> > > > CONNECTED
>> > > > 2014-09-26 12:43:40,751 DEBUG handle being (re)connected
>> > > > 2014-09-26 12:43:55,611 DEBUG Sending request(xid=1): Close()
>> > > > 2014-09-26 12:43:55,614 INFO Closing connection to 127.0.0.1:2181
>> > > > 2014-09-26 12:43:55,615 INFO Zookeeper session lost, state: CLOSED
>> > > > 2014-09-26 12:43:55,615 WARNING handle lost
>> > > >
>> > > >
>> > > > And here is trace-level logging from the server side:
>> > > >
>> > > > 2014-09-26 12:43:20,605 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197][
>> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197%5D%5B>] - Accepted
>> socket
>> > > > connection from /127.0.0.1:58959
>> > > > 2014-09-26 12:43:20,605 [myid:] - DEBUG [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810][
>> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810%5D%5B>] - Session
>> establishment
>> > > > request
>> > > > from client /127.0.0.1:58959 client's lastZxid is 0x0
>> > > > 2014-09-26 12:43:20,605 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868][
>> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868%5D%5B>] - Client
>> attempting to
>> > > > establish new session at /127.0.0.1:58959
>> > > > 2014-09-26 12:43:20,605 [myid:] - TRACE [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooTrace@71][
>> <http://0.0.0.0/0.0.0.0:2181:ZooTrace@71%5D%5B>] - SessionTrackerImpl
>> --- Adding
>> > > > session
>> > > > 0x148b2cd8b010004 30000
>> > > > 2014-09-26 12:43:20,606 [myid:] - TRACE [ProcessThread(sid:0
>> > > > cport:-1)::ZooTrace@90][] - :Psessionid:0x148b2cd8b010004
>> > > > type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown
>> > > > reqpath:n/a
>> > > > 2014-09-26 12:43:20,606 [myid:] - TRACE [ProcessThread(sid:0
>> > > > cport:-1)::ZooTrace@71][] - SessionTrackerImpl --- Existing session
>> > > > 0x148b2cd8b010004 30000
>> > > > 2014-09-26 12:43:20,608 [myid:] - DEBUG
>> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request::
>> > > > sessionid:0x148b2cd8b010004 type:createSession cxid:0x0 zxid:0x78f
>> > > > txntype:-10 reqpath:n/a
>> > > > 2014-09-26 12:43:20,608 [myid:] - TRACE [SyncThread:0:ZooTrace@90][]
>> -
>> > > > :Esessionid:0x148b2cd8b010004 type:createSession cxid:0x0 zxid:0x78f
>> > > > txntype:-10 reqpath:n/a
>> > > > 2014-09-26 12:43:20,608 [myid:] - TRACE [SyncThread:0:ZooTrace@71][]
>> -
>> > > > SessionTrackerImpl --- Existing session 0x148b2cd8b010004 30000
>> > > > 2014-09-26 12:43:20,608 [myid:] - DEBUG
>> > > > [SyncThread:0:FinalRequestProcessor@160][] -
>> > sessionid:0x148b2cd8b010004
>> > > > type:createSession cxid:0x0 zxid:0x78f txntype:-10 reqpath:n/a
>> > > > 2014-09-26 12:43:20,609 [myid:] - INFO
>> > [SyncThread:0:ZooKeeperServer@617
>> > > > ][]
>> > > > - Established session 0x148b2cd8b010004 with negotiated timeout
>> 30000
>> > for
>> > > > client /127.0.0.1:58959
>> > > > 2014-09-26 12:43:40,748 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197][
>> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197%5D%5B>] - Accepted
>> socket
>> > > > connection from /127.0.0.1:58960
>> > > > 2014-09-26 12:43:40,749 [myid:] - DEBUG [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810][
>> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810%5D%5B>] - Session
>> establishment
>> > > > request
>> > > > from client /127.0.0.1:58960 client's lastZxid is 0x0
>> > > > 2014-09-26 12:43:40,749 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861][
>> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861%5D%5B>] - Client
>> attempting to
>> > renew
>> > > > session 0x148b2cd8b010004 at /127.0.0.1:58960
>> > > > 2014-09-26 12:43:40,749 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007][
>> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007%5D%5B>] - Closed socket
>> connection
>> > > for
>> > > > client /127.0.0.1:58959 which had sessionid 0x148b2cd8b010004
>> > > > 2014-09-26 12:43:40,750 [myid:] - TRACE [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooTrace@71][
>> <http://0.0.0.0/0.0.0.0:2181:ZooTrace@71%5D%5B>] - Session
>> 0x148b2cd8b010004 is
>> > > valid:
>> > > > true
>> > > > 2014-09-26 12:43:40,750 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617][
>> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617%5D%5B>] - Established
>> session
>> > > > 0x148b2cd8b010004 with negotiated timeout 30000 for client /
>> > > > 127.0.0.1:58960
>> > > > 2014-09-26 12:43:50,742 [myid:] - DEBUG
>> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request::
>> > > > sessionid:0x148b2cd8b010004 type:ping cxid:0xfffffffffffffffe
>> > > > zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
>> > > > 2014-09-26 12:43:50,742 [myid:] - DEBUG
>> > > > [SyncThread:0:FinalRequestProcessor@160][] -
>> > sessionid:0x148b2cd8b010004
>> > > > type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe
>> > txntype:unknown
>> > > > reqpath:n/a
>> > > > 2014-09-26 12:43:55,612 [myid:] - TRACE [ProcessThread(sid:0
>> > > > cport:-1)::ZooTrace@90][] - :Psessionid:0x148b2cd8b010004
>> > > > type:closeSession
>> > > > cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
>> > > > 2014-09-26 12:43:55,612 [myid:] - INFO  [ProcessThread(sid:0
>> > > > cport:-1)::SessionTrackerImpl@198][] - Session closing:
>> > > 0x148b2cd8b010004
>> > > > 2014-09-26 12:43:55,612 [myid:] - INFO  [ProcessThread(sid:0
>> > > > cport:-1)::PrepRequestProcessor@494][] - Processed session
>> termination
>> > > for
>> > > > sessionid: 0x148b2cd8b010004
>> > > > 2014-09-26 12:43:55,613 [myid:] - DEBUG
>> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request::
>> > > > sessionid:0x148b2cd8b010004 type:closeSession cxid:0x1 zxid:0x790
>> > > > txntype:-11 reqpath:n/a
>> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE [SyncThread:0:ZooTrace@90][]
>> -
>> > > > :Esessionid:0x148b2cd8b010004 type:closeSession cxid:0x1 zxid:0x790
>> > > > txntype:-11 reqpath:n/a
>> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE [SyncThread:0:ZooTrace@71][]
>> -
>> > > > SessionTrackerImpl --- Removing session 0x148b2cd8b010004
>> > > > 2014-09-26 12:43:55,614 [myid:] - DEBUG
>> > > > [SyncThread:0:FinalRequestProcessor@160][] -
>> > sessionid:0x148b2cd8b010004
>> > > > type:closeSession cxid:0x1 zxid:0x790 txntype:-11 reqpath:n/a
>> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE
>> [SyncThread:0:NIOServerCnxn@168
>> > > ][]
>> > > > - Add a buffer to outgoingBuffers, sk
>> > > sun.nio.ch.SelectionKeyImpl@28cfdeb5
>> > > > is valid: true
>> > > > 2014-09-26 12:43:55,615 [myid:] - INFO  [NIOServerCxn.Factory:
>> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007][
>> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007%5D%5B>] - Closed socket
>> connection
>> > > for
>> > > > client /127.0.0.1:58960 which had sessionid 0x148b2cd8b010004
>> > > >
>> > > >
>> > > > I'm assuming that this is due to a misunderstanding on my part.  Can
>> > > anyone
>> > > > point me in the right direction?
>> > > >
>> > > > Much appreciated!
>> > > > - Ian
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message