zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camille Fournier <cami...@apache.org>
Subject Re: Heartbeats not being received / responded to?
Date Wed, 21 Jan 2015 19:25:46 GMT
Do you have the logs from the zk server? What type of machines are you
running this on? Which version of ZK?

Thanks,
C

On Wed, Jan 21, 2015 at 2:00 PM, Ian Rose <ianrose@fullstory.com> wrote:

> Hi all -
>
> Reviving an old thread here.  We have now been running Zookeeper for a few
> months now and this problem has continued to dog us.  This is entirely a
> localhost problem (when zookeeper and the client app are both running on a
> developer laptop) - thankfully everything has been solid on our production
> servers.  Nonetheless this causes major headaches in our development
> process.  For example, I just now debugged a problem our UX lead was having
> getting our dev stack up and running on his local machine; the problem was
> that our startup script was failing when using zkCli.sh to create a "/solr"
> node in ZK.  This would not be a problem if it was a rare/spurious error,
> but he is getting this error every single time.
>
> For reference, here is the complete stack trace (which appears after
> several seconds - the failure is not immediate):
>
> Exception in thread "main"
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /solr
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>     at
> org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:695)
>     at
> org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588)
>     at
> org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360)
>     at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323)
>     at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)
>
> Frustratingly, this behavior is very inconsistent.  In fact, after 5+
> failures in a row on our UX lead's machine, now I just heard from him that
> he tried again and it worked.  Could it be that there is some kind of state
> on Zookeeper that needs to timeout and disappear before things will work?
>
> More concretely, the next time this happens, what kind of info should I
> capture to make this more debuggable?
>
> Many thanks,
> Ian
>
>
>
> On Mon, Sep 29, 2014 at 4:51 PM, Ian Rose <ianrose@fullstory.com> wrote:
>
> > Well unfortunately although I am able to repro this problem with a
> > completely new, clean ZK install on my laptop, I *cannot* repro with the
> > same on my coworkers laptop.  So unfortunately I am forced to conclude
> that
> > there is something strange going on locally.
> >
> > Thanks for the help anyhow!
> >
> > - Ian
> >
> >
> > On Mon, Sep 29, 2014 at 9:40 AM, Camille Fournier <camille@apache.org>
> > wrote:
> >
> >> No that is not expected. Odd that you get disconnected once and then
> >> reconnect fine. Does the same thing happen in your kazoo clients, one
> >> disconnect but then the second connect is ok?
> >> Which version of ZK are you running? Are you running this with some sort
> >> of
> >> auth or password to the zk server?
> >>
> >> Thanks,
> >> C
> >>
> >> On Mon, Sep 29, 2014 at 9:24 AM, Ian Rose <ianrose@fullstory.com>
> wrote:
> >>
> >> > Perhaps I'm simply misunderstanding what the expected behavior would
> be.
> >> > Why would my client by disconnected?  Does zookeeper drop idle
> clients?
> >> > And note that this isn't a spurious disconnect; my client is *always*
> >> > dropped
> >> > at that time.
> >> >
> >> > On Friday zkCli seemed to be working just fine for me, but now I am
> >> getting
> >> > disconnections similar to my kazoo-based client.
> >> >
> >> > Here is a session showing `ls /' failing.  This behavior is
> reproducible
> >> > for me currently.
> >> >
> >> > $ ./bin/zkCli.sh
> >> > > Connecting to localhost:2181
> >> > > Welcome to ZooKeeper!
> >> > > JLine support is enabled
> >> > > [zk: localhost:2181(CONNECTING) 0]
> >> > > WATCHER::
> >> > > WatchedEvent state:SyncConnected type:None path:null
> >> > > [zk: localhost:2181(CONNECTED) 0]
> >> > > [zk: localhost:2181(CONNECTED) 0] ls /
> >> > > WATCHER::
> >> > > WatchedEvent state:Disconnected type:None path:null
> >> > > Exception in thread "main"
> >> > > org.apache.zookeeper.KeeperException$ConnectionLossException:
> >> > > KeeperErrorCode = ConnectionLoss for /
> >> > > at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> >> > > at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >> > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> >> > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> >> > > at
> >> >
> org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:720)
> >> > > at
> >> org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588)
> >> > > at
> >> org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360)
> >> > > at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323)
> >> > > at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)
> >> >
> >> >
> >> > And here is a session where after connecting I just sit and wait for a
> >> bit
> >> > (without sending any commands), whereupon I get disconnected and then
> >> > immediately reconnected.  Interestingly, after the reconnecting I am
> >> able
> >> > to issue multiple commands without any issues.  This behavior is
> >> > reproducible for me currently.
> >> >
> >> > $ ./bin/zkCli.sh
> >> > > Connecting to localhost:2181
> >> > > Welcome to ZooKeeper!
> >> > > JLine support is enabled
> >> > > [zk: localhost:2181(CONNECTING) 0]
> >> > > [zk: localhost:2181(CONNECTING) 0]
> >> > > WATCHER::
> >> > > WatchedEvent state:SyncConnected type:None path:null
> >> > > [zk: localhost:2181(CONNECTED) 0]
> >> > > WATCHER::
> >> > > WatchedEvent state:Disconnected type:None path:null
> >> > > WATCHER::
> >> > > WatchedEvent state:SyncConnected type:None path:null
> >> > > [zk: localhost:2181(CONNECTED) 0]
> >> > > [zk: localhost:2181(CONNECTED) 0] ls /
> >> > > [zookeeper]
> >> > > [zk: localhost:2181(CONNECTED) 1]
> >> > > [zk: localhost:2181(CONNECTED) 1]
> >> > > [zk: localhost:2181(CONNECTED) 1] ls /zookeeper
> >> > > [quota]
> >> > > [zk: localhost:2181(CONNECTED) 2] ls /zookeeper/quota
> >> > > []
> >> >
> >> >
> >> >
> >> > Thanks,
> >> > - Ian
> >> >
> >> >
> >> > On Sun, Sep 28, 2014 at 2:22 PM, Camille Fournier <camille@apache.org
> >
> >> > wrote:
> >> >
> >> > > Sorry but with what you've sent us I don't really see what the
> problem
> >> > is.
> >> > > It does look like you connect and then nothing happens for 20s and
> >> then
> >> > the
> >> > > connection is dropped. If you use the zkCli script to connect via
> the
> >> > > command line do you see the same problem?
> >> > >
> >> > > C
> >> > >
> >> > > On Fri, Sep 26, 2014 at 12:48 PM, Ian Rose <ianrose@fullstory.com>
> >> > wrote:
> >> > >
> >> > > > Hi all -
> >> > > >
> >> > > > I've just gotten started using SolrCloud, which uses Zookeeper
for
> >> > > > coordination; I am otherwise completely new to Zookeeper.  Now
I
> am
> >> > > trying
> >> > > > to query Zookeeper directly for some simple information.  I am
> >> finding,
> >> > > > however, that although my clients are able to connect they very
> >> > > frequently
> >> > > > receive timeouts.  It almost seems like the server isn't receiving
> >> the
> >> > > > heartbeat messages at all (or isn't responding to them).  I've
> seen
> >> > > similar
> >> > > > behavior both when using the go-zookeeper
> >> > > > <https://godoc.org/github.com/samuel/go-zookeeper/zk> and
kazoo
> >> > > > <https://github.com/python-zk/kazoo> (python) client libraries
(I
> >> > wanted
> >> > > > to
> >> > > > try >1 to ensure that it wasn't a client lib problem).
> >> > > >
> >> > > > My config is very simple: I am running the client and a single
> >> > Zookeeper
> >> > > > node on the same machine (my laptop).  There are no other clients
> of
> >> > the
> >> > > > Zookeeper node while I am running these tests, so there is no
> >> practical
> >> > > > possibility that the JVM is overloaded or GCing.
> >> > > >
> >> > > > Here is a very basic kazoo client that I am using.  Obviously
it
> >> isn't
> >> > > > doing any "real" work right now - this is just to demonstrate
the
> >> > > > disconnects.
> >> > > >
> >> > > > #!/usr/bin/env python2.7
> >> > > >
> >> > > > from kazoo.client import KazooClient, KazooState
> >> > > > import logging
> >> > > > import time
> >> > > >
> >> > > > def my_listener(state):
> >> > > >   if state == KazooState.LOST:
> >> > > >     # Register somewhere that the session was lost
> >> > > >     logging.warning('handle lost')
> >> > > >   elif state == KazooState.SUSPENDED:
> >> > > >     # Handle being disconnected from Zookeeper
> >> > > >     logging.debug('handle being disconnected')
> >> > > >   else:
> >> > > >     # Handle being connected/reconnected to Zookeeper
> >> > > >     logging.debug('handle being (re)connected')
> >> > > >
> >> > > >
> >> > > > if __name__ == '__main__':
> >> > > >   logging.basicConfig(format='%(asctime)-15s %(levelname)s
> >> > %(message)s',
> >> > > > level=logging.DEBUG)
> >> > > >
> >> > > >   logging.debug('starting...')
> >> > > >   zk = KazooClient(hosts='127.0.0.1:2181', timeout=30)
> >> > > >   zk.start()
> >> > > >   zk.add_listener(my_listener)
> >> > > >
> >> > > >   time.sleep(35)
> >> > > >
> >> > > >   zk.stop()
> >> > > >
> >> > > > --------------------
> >> > > >
> >> > > > Here is my zookeeper config:
> >> > > >
> >> > > > # The number of milliseconds of each tick
> >> > > > tickTime=2000
> >> > > >
> >> > > > # The number of ticks that the initial
> >> > > > # synchronization phase can take
> >> > > > initLimit=10
> >> > > >
> >> > > > # The number of ticks that can pass between
> >> > > > # sending a request and getting an acknowledgement
> >> > > > syncLimit=5
> >> > > >
> >> > > > # the directory where the snapshot is stored.
> >> > > > dataDir=/Users/ianrose/Code/zookeeper/var/data
> >> > > >
> >> > > > # the port at which the clients will connect
> >> > > > clientPort=2181
> >> > > >
> >> > > > # The number of snapshots to retain in dataDir
> >> > > > autopurge.snapRetainCount=5
> >> > > >
> >> > > > # Purge task interval in hours
> >> > > > # Set to "0" to disable auto purge feature
> >> > > > autopurge.purgeInterval=1
> >> > > >
> >> > > >
> >> > > > --------------------
> >> > > >
> >> > > >
> >> > > > Here is the output I get on the client:
> >> > > >
> >> > > > 2014-09-26 12:43:20,603 DEBUG starting...
> >> > > > 2014-09-26 12:43:20,604 INFO Connecting to 127.0.0.1:2181
> >> > > > 2014-09-26 12:43:20,605 DEBUG Sending request(xid=None):
> >> > > > Connect(protocol_version=0, last_zxid_seen=0, time_out=30000,
> >> > > session_id=0,
> >> > > >
> >> > >
> >> >
> >>
> passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
> >> > > > read_only=None)
> >> > > > 2014-09-26 12:43:20,609 INFO Zookeeper connection established,
> >> state:
> >> > > > CONNECTED
> >> > > > 2014-09-26 12:43:40,222 WARNING Connection dropped: outstanding
> >> > heartbeat
> >> > > > ping not received
> >> > > > 2014-09-26 12:43:40,222 WARNING Transition to CONNECTING
> >> > > > 2014-09-26 12:43:40,222 INFO Zookeeper connection lost
> >> > > > 2014-09-26 12:43:40,222 DEBUG handle being disconnected
> >> > > > 2014-09-26 12:43:40,747 INFO Connecting to 127.0.0.1:2181
> >> > > > 2014-09-26 12:43:40,748 DEBUG Sending request(xid=None):
> >> > > > Connect(protocol_version=0, last_zxid_seen=0, time_out=30000,
> >> > > > session_id=92520388231233540,
> >> > > > passwd='\xd1M\xb9\xb3\xae\xab\xa1!@x\x06nv\xb7\xe3*',
> >> read_only=None)
> >> > > > 2014-09-26 12:43:40,750 INFO Zookeeper connection established,
> >> state:
> >> > > > CONNECTED
> >> > > > 2014-09-26 12:43:40,751 DEBUG handle being (re)connected
> >> > > > 2014-09-26 12:43:55,611 DEBUG Sending request(xid=1): Close()
> >> > > > 2014-09-26 12:43:55,614 INFO Closing connection to 127.0.0.1:2181
> >> > > > 2014-09-26 12:43:55,615 INFO Zookeeper session lost, state: CLOSED
> >> > > > 2014-09-26 12:43:55,615 WARNING handle lost
> >> > > >
> >> > > >
> >> > > > And here is trace-level logging from the server side:
> >> > > >
> >> > > > 2014-09-26 12:43:20,605 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197][
> >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197%5D%5B>] -
> Accepted
> >> socket
> >> > > > connection from /127.0.0.1:58959
> >> > > > 2014-09-26 12:43:20,605 [myid:] - DEBUG [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810%5D%5B>] - Session
> >> establishment
> >> > > > request
> >> > > > from client /127.0.0.1:58959 client's lastZxid is 0x0
> >> > > > 2014-09-26 12:43:20,605 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868%5D%5B>] - Client
> >> attempting to
> >> > > > establish new session at /127.0.0.1:58959
> >> > > > 2014-09-26 12:43:20,605 [myid:] - TRACE [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooTrace@71][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooTrace@71%5D%5B>] - SessionTrackerImpl
> >> --- Adding
> >> > > > session
> >> > > > 0x148b2cd8b010004 30000
> >> > > > 2014-09-26 12:43:20,606 [myid:] - TRACE [ProcessThread(sid:0
> >> > > > cport:-1)::ZooTrace@90][] - :Psessionid:0x148b2cd8b010004
> >> > > > type:createSession cxid:0x0 zxid:0xfffffffffffffffe
> txntype:unknown
> >> > > > reqpath:n/a
> >> > > > 2014-09-26 12:43:20,606 [myid:] - TRACE [ProcessThread(sid:0
> >> > > > cport:-1)::ZooTrace@71][] - SessionTrackerImpl --- Existing
> session
> >> > > > 0x148b2cd8b010004 30000
> >> > > > 2014-09-26 12:43:20,608 [myid:] - DEBUG
> >> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request::
> >> > > > sessionid:0x148b2cd8b010004 type:createSession cxid:0x0 zxid:0x78f
> >> > > > txntype:-10 reqpath:n/a
> >> > > > 2014-09-26 12:43:20,608 [myid:] - TRACE [SyncThread:0:ZooTrace@90
> ][]
> >> -
> >> > > > :Esessionid:0x148b2cd8b010004 type:createSession cxid:0x0
> zxid:0x78f
> >> > > > txntype:-10 reqpath:n/a
> >> > > > 2014-09-26 12:43:20,608 [myid:] - TRACE [SyncThread:0:ZooTrace@71
> ][]
> >> -
> >> > > > SessionTrackerImpl --- Existing session 0x148b2cd8b010004 30000
> >> > > > 2014-09-26 12:43:20,608 [myid:] - DEBUG
> >> > > > [SyncThread:0:FinalRequestProcessor@160][] -
> >> > sessionid:0x148b2cd8b010004
> >> > > > type:createSession cxid:0x0 zxid:0x78f txntype:-10 reqpath:n/a
> >> > > > 2014-09-26 12:43:20,609 [myid:] - INFO
> >> > [SyncThread:0:ZooKeeperServer@617
> >> > > > ][]
> >> > > > - Established session 0x148b2cd8b010004 with negotiated timeout
> >> 30000
> >> > for
> >> > > > client /127.0.0.1:58959
> >> > > > 2014-09-26 12:43:40,748 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197][
> >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197%5D%5B>] -
> Accepted
> >> socket
> >> > > > connection from /127.0.0.1:58960
> >> > > > 2014-09-26 12:43:40,749 [myid:] - DEBUG [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810%5D%5B>] - Session
> >> establishment
> >> > > > request
> >> > > > from client /127.0.0.1:58960 client's lastZxid is 0x0
> >> > > > 2014-09-26 12:43:40,749 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861%5D%5B>] - Client
> >> attempting to
> >> > renew
> >> > > > session 0x148b2cd8b010004 at /127.0.0.1:58960
> >> > > > 2014-09-26 12:43:40,749 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007][
> >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007%5D%5B>] - Closed socket
> >> connection
> >> > > for
> >> > > > client /127.0.0.1:58959 which had sessionid 0x148b2cd8b010004
> >> > > > 2014-09-26 12:43:40,750 [myid:] - TRACE [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooTrace@71][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooTrace@71%5D%5B>] - Session
> >> 0x148b2cd8b010004 is
> >> > > valid:
> >> > > > true
> >> > > > 2014-09-26 12:43:40,750 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617][
> >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617%5D%5B>] - Established
> >> session
> >> > > > 0x148b2cd8b010004 with negotiated timeout 30000 for client /
> >> > > > 127.0.0.1:58960
> >> > > > 2014-09-26 12:43:50,742 [myid:] - DEBUG
> >> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request::
> >> > > > sessionid:0x148b2cd8b010004 type:ping cxid:0xfffffffffffffffe
> >> > > > zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
> >> > > > 2014-09-26 12:43:50,742 [myid:] - DEBUG
> >> > > > [SyncThread:0:FinalRequestProcessor@160][] -
> >> > sessionid:0x148b2cd8b010004
> >> > > > type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe
> >> > txntype:unknown
> >> > > > reqpath:n/a
> >> > > > 2014-09-26 12:43:55,612 [myid:] - TRACE [ProcessThread(sid:0
> >> > > > cport:-1)::ZooTrace@90][] - :Psessionid:0x148b2cd8b010004
> >> > > > type:closeSession
> >> > > > cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
> >> > > > 2014-09-26 12:43:55,612 [myid:] - INFO  [ProcessThread(sid:0
> >> > > > cport:-1)::SessionTrackerImpl@198][] - Session closing:
> >> > > 0x148b2cd8b010004
> >> > > > 2014-09-26 12:43:55,612 [myid:] - INFO  [ProcessThread(sid:0
> >> > > > cport:-1)::PrepRequestProcessor@494][] - Processed session
> >> termination
> >> > > for
> >> > > > sessionid: 0x148b2cd8b010004
> >> > > > 2014-09-26 12:43:55,613 [myid:] - DEBUG
> >> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request::
> >> > > > sessionid:0x148b2cd8b010004 type:closeSession cxid:0x1 zxid:0x790
> >> > > > txntype:-11 reqpath:n/a
> >> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE [SyncThread:0:ZooTrace@90
> ][]
> >> -
> >> > > > :Esessionid:0x148b2cd8b010004 type:closeSession cxid:0x1
> zxid:0x790
> >> > > > txntype:-11 reqpath:n/a
> >> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE [SyncThread:0:ZooTrace@71
> ][]
> >> -
> >> > > > SessionTrackerImpl --- Removing session 0x148b2cd8b010004
> >> > > > 2014-09-26 12:43:55,614 [myid:] - DEBUG
> >> > > > [SyncThread:0:FinalRequestProcessor@160][] -
> >> > sessionid:0x148b2cd8b010004
> >> > > > type:closeSession cxid:0x1 zxid:0x790 txntype:-11 reqpath:n/a
> >> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE
> >> [SyncThread:0:NIOServerCnxn@168
> >> > > ][]
> >> > > > - Add a buffer to outgoingBuffers, sk
> >> > > sun.nio.ch.SelectionKeyImpl@28cfdeb5
> >> > > > is valid: true
> >> > > > 2014-09-26 12:43:55,615 [myid:] - INFO  [NIOServerCxn.Factory:
> >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007][
> >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007%5D%5B>] - Closed socket
> >> connection
> >> > > for
> >> > > > client /127.0.0.1:58960 which had sessionid 0x148b2cd8b010004
> >> > > >
> >> > > >
> >> > > > I'm assuming that this is due to a misunderstanding on my part.
> Can
> >> > > anyone
> >> > > > point me in the right direction?
> >> > > >
> >> > > > Much appreciated!
> >> > > > - Ian
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message