hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: Membership using ZK
Date Tue, 12 Oct 2010 17:45:18 GMT
  ZooKeeper considers a client dead when it hasn't heard from that 
client during the timeout period. clients make sure to communicate with 
ZooKeeper at least once in 1/3 the timeout period. if the client doesn't 
hear from ZooKeeper in 2/3 the timeout period, the client will issue a 
ConnectionLoss event and cause outstanding requests to fail with a 

So, if ZooKeeper decides a process is dead, the process will get a 
ConnectionLoss event. Once ZooKeeper decides that a client is dead, if 
the client reconnects, the client will get a SessionExpired. Once a 
session is expired, the expired handle will become useless, so no new 
requests, no watches, etc.

The bottom line is if your process gets a process expired, you need to 
treat that process as expired and recover by creating a new zookeeper 
handle (possibly by restarting the process) and resetup your state.


On 10/12/2010 09:54 AM, Avinash Lakshman wrote:
> This is what I have going:
> I have a bunch of 200 nodes come up and create an ephemeral entry under a
> znode names /Membership. When nodes are detected dead the node associated
> with the dead node under /Membership is deleted and watch delivered to the
> rest of the members. Now there are circumstances a node A is deemed dead
> while the process is still up and running on A. It is a false detection
> which I need to probably deal with. How do I deal with this situation?  Over
> time false detections delete all the entries underneath the /Membership
> znode even though all processes are up and running.
> So my questions are:
> Would the watches be pushed out to the node that is falsely deemed dead? If
> so I can have that process recreate the ephemeral znode underneath
> /Membership.
> If a node leaves a watch and then truly crashes. When it comes back up would
> it get watches it missed during the interim period? In any case how do
> watches behave in the event of false/true failure detection?
> Thanks
> A

View raw message