Hi Rakesh, Thank you so much for the reply. The client program that I have used is a simple one which implements the Leader Election. It creates a ephemeral node and watches it and once that is deleted a new leader is selected. Client Code https://gist.github.com/tharindukumara/025c98ddbd406db34067a587c6d9533d For this test, I deployed both the client and Server C in the same machine. I think, you can also recreate the scenario, if you deploy the zookeeper servers and clients as I mentioned. This is the output of a gstack while executing the client program. Thread 3 (Thread 0x7fa5a7a2d700 (LWP 9057)): #0 0x00007fa5a7d27c1d in poll () from /lib64/libc.so.6 #1 0x00007fa5a88a34d8 in copy_watchers (clone=, to=0x118e3b0, from=) at src/zk_hashtable.c:233 #2 add_for_event (ht=, path=, list=) at src/zk_hashtable.c:265 #3 0x00007fa5a7c460f0 in __sched_getparam@got.plt () from /lib64/libpthread.so.0 #4 0x0000000000000000 in ?? () Thread 2 (Thread 0x7fa5a722c700 (LWP 9058)): #0 0x00007fa5a7a3a05f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa5a88a3686 in destroy_zk_hashtable (ht=0x118e514) at src/zk_hashtable.c:143 #2 0x000000000118e3b0 in ?? () #3 0x0000000000000000 in ?? () Thread 1 (Thread 0x7fa5a8cad740 (LWP 9056)): #0 0x00007fa5a7d01ded in nanosleep () from /lib64/libc.so.6 #1 0x00007fa5a7d01c84 in sleep () from /lib64/libc.so.6 #2 0x000000000040213c in main () Can you please give me a proper reason for this scenario? To me it looks like when the client connects to server B, it initiates a blocking call to Server B. And since the port is open to communicate it waits on it too much time. Thanks Tharindu On Mon, Mar 6, 2017 at 12:00 PM, Rakesh Radhakrishnan wrote: > Sorry, it was a busy weekend for me. I hope the following will give more > clarity. > > (1) Client connected to a server and sends heartbeat periodically > Send pings 1/3 the session timeout > https://github.com/apache/zookeeper/blob/branch-3.4/src/c/ > src/zookeeper.c#L1675 > In your case, sessiontimeout=45secs, then approax 15secs is the ping > interval > > (2) Client connected to a server and the connected server is unavailable. > Missing ping response. Usually, this time period could be (ping interval + > small grace period, for idle time calc). > This allows time to reconnect to a different server. > > In your case, timeout 45 secs, client took 15 secs to get the disconnection > event from server C. > Then, it took 30 secs to hit connection error from server B. This is very > suspicious and ideally client should receive an error within 1/3 timeout + > small grace period. > Any chance to check the status of the client machine during this 30 secs > time period. Do you think Kernel message can help ? > > Below is, the session "0x35a926acae80000" related log from the shared > client log. > >>> zookeeper_client: 2017-03-03 > 09:56:14,770:23078(0x7f9467786700):ZOO_INFO@check_events@1775: session > establishment complete on server [172.25.83.205:2181], > sessionId=0x35a926acae80000, negotiated timeout=45000 > >>> zookeeper_client: 2017-03-03 > 09:56:30,770:23078(0x7f9467786700):ZOO_DEBUG@zookeeper_process@2218: Got > ping response in 1 ms > >>> zookeeper_client: 2017-03-03 > 09:56:42,213:23078(0x7f9467786700):ZOO_ERROR@handle_socket_error_msg@1746: > Socket [172.25.83.205:2181] zk retcode=-4, errno=112(Host is down): failed > while receiving aserver response > >>> zookeeper_client: 2017-03-03 > 09:57:12,220:23078(0x7f9467786700):ZOO_ERROR@handle_socket_error_msg@1666: > Socket [172.25.83.204:2181] zk retcode=-7, errno=110(Connection timed > out): > connection to 172.25.83.204:2181 timed out (exceeded timeout by 6ms) > >>> zookeeper_client: 2017-03-03 > 09:57:12,221:23078(0x7f9467786700):ZOO_INFO@check_events@1728: initiated > connection to server [172.25.83.201:2181] > >>> zookeeper_client: 2017-03-03 > 09:57:12,223:23078(0x7f9467786700):ZOO_ERROR@handle_socket_error_msg@1764: > Socket [172.25.83.201:2181] zk retcode=-112, errno=116(Stale file handle): > sessionId=0x35a926acae80000 has expired. > > Rakesh > > On Mon, Mar 6, 2017 at 10:11 AM, Tharindu Kumara < > zonik.hatkumara@gmail.com> > wrote: > > > Hi Rakesh, > > > > Any updates on this? > > > > Is this a bug or an expected situation? > > > > Thanks > > > > > > > > On Fri, Mar 3, 2017 at 10:49 AM, Tharindu Kumara < > > zonik.hatkumara@gmail.com> > > wrote: > > > > > Hi Rakesh, > > > > > > I have made a small mistake in the above email. > > > For client's session timeout I have used 20000ms not 45000ms. > > > Sorry for the mistake. > > > Basically it waits for 28 seconds to connect to Server A. (14 seconds > to > > > figure out disconnection and another 14 seconds to connect to Server B) > > > > > > > > > > > > > > > Then as soon as it connected to Server A, Server A sends the session > > > expiration. > > > Then I again did the same experiment with setting client's session > > > expiration time to 45000ms. > > > Then it took 15secs to retrieve the disconnection event to client. > > > After from that point, client took another 30secs to trying to connect > to > > > Server B. > > > And then after that, client figures out that it cannot connect to > Server > > B > > > and try to initiate a connection to Server A. > > > Since 15 + 30 secs is 45, when as soon as client connected to Server A, > > > Server A sends the session expiration event. > > > What is the reason for that? It looks like client is trying too much > time > > > to connect to Server B. > > > > > > > > > Here I attached the Server B's log (Leader) and client's log. > > > Server A = 172.25.83.201 > > > Server B = 172.25.83.204 > > > Server C = 172.25.83.205 > > > And client's session id is 0x35a926acae80000 > > > > > > https://drive.google.com/file/d/0B9o9GJ_CoG1_ > > > NWRTQ3hmeGpZRktrc040MkZzTk5LTDdpMi1v/view?usp=sharing > > > https://drive.google.com/file/d/0B9o9GJ_CoG1_ > > > VHlKa092QzB6M3NPemRORk9Lc0w1a2N5ZDdz/view?usp=sharing > > > > > > And also in the above post, you mentioned that, you mentioned about a > > > connection-timeout. > > > Can you please explain a bit more? > > > > > > I see that when I try with both 45000ms and 20000ms, it takes around > 14 - > > > 15 seconds to figure out a disconnection from clients end. > > > Looks like it is a constant and client has (session timeout - > connection > > > loss notified time) to find a new Server. Am I correct here? > > > You mentioned that client connection timeout is equal to > > > ("sessiontimeout/listed servers count"). > > > Please explain a bit about the client ping times and connection > timeout. > > > How these to co-relate? > > > > > > On Thu, Mar 2, 2017 at 8:18 PM, Rakesh Radhakrishnan < > rakeshr@apache.org > > > > > > wrote: > > > > > >> >>> You mentioned that a client sends a ping every 1/3 the session > > >> timeout. > > >> Yes, you are correct. Again to analyse your issue, we have to consider > > >> re-connection timeout also, which is "sessiontimeout/listed servers > > count" > > >> https://github.com/apache/zookeeper/blob/branch-3.4/src/java > > >> /main/org/apache/zookeeper/ClientCnxn.java#L1292 > > >> https://github.com/apache/zookeeper/blob/branch-3.4/src/java > > >> /main/org/apache/zookeeper/ClientCnxn.java#L1098 > > >> > > >> Coincidentally, in your example both heartbeat interval and > > re-connection > > >> interval are same as you have three servers. > > >> > > >> >>>>> It looks like C3 has taken 14 seconds to determine the > > disconnected > > >> event > > >> >>>>> and another 14 seconds to that it cannot connect to Server B(C3 > is > > >> isolated > > >> >>>>> from B). > > >> > > >> With this info, total elapsed time is 28 secs which is less than 45 > secs > > >> session timeout. Now, the client has 17 secs (45 secs - 28 secs) time > > >> period to re-establish a connection with server A, right? Could you > > please > > >> check whether the client is connecting to A during this period? > > >> > > >> Rakesh > > >> > > >> > > >> > > >> On Thu, Mar 2, 2017 at 6:58 PM, Tharindu Kumara < > > >> zonik.hatkumara@gmail.com> > > >> wrote: > > >> > > >> > ​​ > > >> > Hi Rakesh, > > >> > > > >> > First of all thank you for the quick reply. > > >> > > > >> > >>>>> Actually, ZooKeeper client has retry mechanism. > > >> > >>>>> Client sends a ping every 1/3 the session timeout (here, 3 is > > the > > >> no. > > >> > of listed servers, A, B, C) and then looks for a response before > > another > > >> > 1/3 elapses. This allows time to reconnect to a different server > (and > > >> still > > >> > maintain the session) if the connected server becomes unavailable. > > >> > > > >> > You mentioned that a client sends a ping every 1/3 the session > > timeout. > > >> And > > >> > 3 is the no of listed servers. > > >> > > > >> > I doubt that. Because, I am using the C Binding and after inspecting > > the > > >> > code it looks like that 3 is a hard coded value. > > >> > Simply no matter what the number of clients, zk client biding is > > always > > >> > sending a ping every 1/3 session timeout. > > >> > > > >> > Can please clarify that for me? > > >> > > > >> > Here I used a tick of 3000ms and session expiration timeout of > > 45000ms. > > >> > > > >> > And please find the screenshot of extacted client log outout. > > >> > > > >> > https://anonimag.es/image/JT9htnL > > >> > > > >> > It looks like C3 has taken 14 seconds to determine the disconnected > > >> event > > >> > and another 14 seconds to that it cannot connect to Server B(C3 is > > >> isolated > > >> > from B). > > >> > > > >> > > > >> > > > >> > On Thu, Mar 2, 2017 at 4:08 PM, Rakesh Radhakrishnan < > > >> rakeshr@apache.org> > > >> > wrote: > > >> > > > >> > > >>>> According to my understanding, it looks like, when a client > > >> trying > > >> > to > > >> > > >>>> connect to a server that it cannot connect due to a network > > >> > > partitioning, > > >> > > >>>> it uses a blocking call and it waits too much time trying to > > >> > > >>>> connect to a server that it cannot communicate. > > >> > > > > >> > > Actually, ZooKeeper client has retry mechanism. > > >> > > Client sends a ping every 1/3 the session timeout (here, 3 is the > > no. > > >> of > > >> > > listed servers, A, B, C) > > >> > > and then looks for a response before another 1/3 elapses. This > > allows > > >> > time > > >> > > to reconnect to a > > >> > > different server (and still maintain the session) if the connected > > >> server > > >> > > becomes unavailable. > > >> > > > > >> > > Could you grep the following log message in your client log and > tell > > >> me > > >> > how > > >> > > much time C3 taken for the re-connection attempts. > > >> > > "Client session timed out, have not heard from server in " > > >> > > > > >> > > C3 might have first attempted to reconnect to B and then A. Also, > > >> need to > > >> > > check how much time C3 taken to detect connection failure from > > server > > >> C. > > >> > > > > >> > > Could you please share the zk client log to dig more. > > >> > > > > >> > > Rakesh > > >> > > > > >> > > > > >> > > On Thu, Mar 2, 2017 at 11:04 AM, Tharindu Kumara < > > >> > > zonik.hatkumara@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > > ​ > > >> > > > 1) Could you tell me the status of Server C, is this lost > > >> connection to > > >> > > the > > >> > > > > quorum and fails to join quorum continuously as B is the > > >> Leader > > >> > ? > > >> > > > > > >> > > > Yes, B the leader. C Server is completely isolated from the > > >> Leader(B) > > >> > > > and It cannot communicate with the Leader. C cannot continuously > > >> > connect > > >> > > to > > >> > > > the > > >> > > > > > >> > > > Leader. > > >> > > > > > >> > > > > > >> > > > > 2) C3 is connected C. Please tell me the connection host > string > > >> > passed > > >> > > > to > > >> > > > > this client. Does it contains all three servers info > > >> > > "A:clientport, > > >> > > > > B:clientport, C:clientport" ? > > >> > > > > > >> > > > Yes, C3's connection string contains all three servers. > > >> ("A:clientport, > > >> > > > B:clientport, C:clientport") > > >> > > > > > >> > > > > > >> > > > > 3) Please check all three servers and client C3 logs to see > any > > >> > > > > inconsistencies or exceptions. > > >> > > > > > >> > > > After looking at logs, it seems when the server C isolated from > > the > > >> > > Leader, > > >> > > > > > >> > > > a disconnect event fires to client C3. Then it (C3) tries too > much > > >> time > > >> > > to > > >> > > > connect to Server B(Leader) . > > >> > > > > > >> > > > But it cannot connect to server B, as we blocked the connection > > >> between > > >> > > > Server C and > > >> > > > > > >> > > > Server B. Basically, C3 tries more than half of the session > > timeout > > >> > time > > >> > > to > > >> > > > connect to Server B. > > >> > > > > > >> > > > Then after figuring out that C3 cannot to connect to Server B, > it > > >> tries > > >> > > to > > >> > > > connect > > >> > > > > > >> > > > to Server A, and it connects to Server A successfully. But this > is > > >> too > > >> > > > late, because > > >> > > > > > >> > > > session is already expired at the time C3 connected. > > >> > > > > > >> > > > And this happens sometimes only. Because when we specify all the > > >> > servers > > >> > > in > > >> > > > the client's > > >> > > > > > >> > > > connect string, sometimes after C3 disconnecting from Server C, > > >> instead > > >> > > of > > >> > > > trying to connect to > > >> > > > > > >> > > > Server B it connects to Server A as the first attempt. In this > > case > > >> the > > >> > > > client C3 connects to the > > >> > > > > > >> > > > quorum successfully before the session expiration. > > >> > > > > > >> > > > According to my understanding, it looks like, when a client > trying > > >> to > > >> > > > connect to a server that it cannot > > >> > > > > > >> > > > connect due to a network partitioning, it uses a blocking call > and > > >> it > > >> > > waits > > >> > > > too much time trying to > > >> > > > > > >> > > > connect to a server that it cannot communicate. > > >> > > > > > >> > > > > > >> > > > > > >> > > > > 4) ZooKeeper version used in your testing ? > > >> > > > > > >> > > > I used zookeeper 3.4.9 (current stable release) > > >> > > > > > >> > > > > > >> > > > > > >> > > > On Thu, Mar 2, 2017 at 7:48 AM, Rakesh Radhakrishnan < > > >> > rakeshr@apache.org > > >> > > > > > >> > > > wrote: > > >> > > > > > >> > > > > Hi, > > >> > > > > > > >> > > > > Could you please give few more details, > > >> > > > > > > >> > > > > ​​ > > >> > > > > 1) Could you tell me the status of Server C, is this lost > > >> connection > > >> > to > > >> > > > the > > >> > > > > quorum and fails to join quorum continuously as B is the > Leader > > ? > > >> > > > > > > >> > > > > 2) C3 is connected C. Please tell me the connection host > string > > >> > passed > > >> > > to > > >> > > > > this client. Does it contains all three servers info > > >> "A:clientport, > > >> > > > > B:clientport, C:clientport" ? > > >> > > > > > > >> > > > > 3) Please check all three servers and client C3 logs to see > any > > >> > > > > inconsistencies or exceptions. > > >> > > > > > > >> > > > > 4) ZooKeeper version used in your testing ? > > >> > > > > > > >> > > > > > > >> > > > > Rakesh > > >> > > > > > > >> > > > > On Wed, Mar 1, 2017 at 4:55 PM, Tharindu Kumara < > > >> > > > zonik.hatkumara@gmail.com > > >> > > > > > > > >> > > > > wrote: > > >> > > > > > > >> > > > > > ​Recently, carried out a test to to find the behavior of > > clients > > >> > > when a > > >> > > > > > client is partitioned from the ensemble. > > >> > > > > > > > >> > > > > > Here I used a ensemble of 3 zookeeper servers called A, B > and > > C. > > >> > And > > >> > > > > quorum > > >> > > > > > was set up like below. > > >> > > > > > > > >> > > > > > A - Follower > > >> > > > > > B - Leader > > >> > > > > > C - Follower​ > > >> > > > > > > > >> > > > > > A <---> B <---> C > > >> > > > > > \____________/ > > >> > > > > > > > >> > > > > > And 3 clients are connected to ensemble like below. > > >> > > > > > > > >> > > > > > C1 is connected A > > >> > > > > > C2 is connected B > > >> > > > > > C3 is connected C. > > >> > > > > > > > >> > > > > > I used iptables to remove the network link between B and C. > > >> > > > > > > > >> > > > > > command used: iptables -I INPUT -s 123.123.45.123 -j DROP > > >> > > > > > > > >> > > > > > After removing the link connections looks like below. > > >> > > > > > > > >> > > > > > A <----> B C > > >> > > > > > \____________/ > > >> > > > > > > > >> > > > > > Simply there is no way to communicate from B to C and vice > > >> versa. > > >> > > > > > > > >> > > > > > Here What I noticed is that the client connected to > Zookeeper > > >> > Server > > >> > > > "C", > > >> > > > > > could not connect to the ensemble resulting a session > > expiration > > >> > > > timeout. > > >> > > > > > > > >> > > > > > For this experiment I used tickTime of 3000ms and client > > session > > >> > > > > expiration > > >> > > > > > timeout of 45000ms. And tested with different combinations > > also. > > >> > > > > > > > >> > > > > > Can someone please explain what is the root cause for this > > >> > behavior? > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >