From issues-return-883-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Sep 4 23:41:03 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 518E1180674 for ; Thu, 5 Sep 2019 01:41:03 +0200 (CEST) Received: (qmail 7441 invoked by uid 500); 5 Sep 2019 06:05:42 -0000 Mailing-List: contact issues-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list issues@zookeeper.apache.org Received: (qmail 7429 invoked by uid 99); 5 Sep 2019 06:05:42 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Sep 2019 06:05:42 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1250DE3116 for ; Wed, 4 Sep 2019 23:41:02 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 8EC27782260 for ; Wed, 4 Sep 2019 23:41:00 +0000 (UTC) Date: Wed, 4 Sep 2019 23:41:00 +0000 (UTC) From: "Michael Han (Jira)" To: issues@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-3534) Non-stop communication between participants and observers. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ZOOKEEPER-3534?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 16922942#comment-16922942 ]=20 Michael Han commented on ZOOKEEPER-3534: ---------------------------------------- yes the problematic code path in this case is inside Messenger's sender and= receiver, where it does not use any of the min/max interval timeouts (the = value is hardcoded as 3000 milli sec at the moment, which, seems like a pla= ce to improve to make it configurable in a separate jira.). Even we make the sender / recv queue poll timeout configurable for Messenge= r, I am not sure if this will solve the issue. Once we run into such commun= ication state, the queues will always be non empty so timeout does matter m= uch here - until the logic bail out itself.=20 Avoid getting into such state in first place (proposed fix 1) seems a good = approach. > Non-stop communication between participants and observers. > ---------------------------------------------------------- > > Key: ZOOKEEPER-3534 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3534 > Project: ZooKeeper > Issue Type: Bug > Reporter: Karolos Antoniadis > Priority: Minor > Attachments: create_np_case_3.sh > > > Hello ZooKeeper developers, > there are cases during *leader election*, where there is non-stop communi= cation between observers and participants.=20 > This communication occurs as follows:=20 > - an observer sends a notification to a participant > - the participant responds > - an observer sends another notification and so on and so forth ... > It is possible that an observer-participant pair exchange hundreds of not= ification messages in the span of one second. As a consequence, the system = is burdened with unnecessary load, and the logs are filled with useless inf= ormation as can be seen below: > =C2=A0 > {noformat} > 2019-09-03 16:37:22,630 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea= derElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING= , n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers= ion:0x2, n.config version:0x100000000 > 2019-09-03 16:37:22,632 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea= derElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING= , n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers= ion:0x2, n.config version:0x100000000 > 2019-09-03 16:37:22,633 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea= derElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING= , n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers= ion:0x2, n.config version:0x100000000 > 2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea= derElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING= , n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers= ion:0x2, n.config version:0x100000000 > 2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea= derElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING= , n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers= ion:0x2, n.config version:0x100000000{noformat} > =C2=A0 > =C2=A0 > h4. Why does the non-stop communication bug occur? > This bug stems from the fact that when a participant receives a notificat= ion from an observer, the participant responds right away, as can be seen [= here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/m= ain/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L325]= =C2=A0- it is even written in the comments. Now, when the observer receives= back the message from the participant there are 2 cases that could lead to= non-stop communication: > 1) The observer has a greater {{logicalclock}}=C2=A0than the participant = and both the observer and the participant are in a {{LOOKING}}=C2=A0state. = In such a case, the observer responds right away to the participant as can = be seen [here|https://github.com/apache/zookeeper/blob/master/zookeeper-ser= ver/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.jav= a#L392].=20 > 2) The observer is {{OBSERVING}}=C2=A0while the participant is still {{LO= OKING}}, then the non-stop communication ensues due to the code in [here|ht= tps://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/jav= a/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L413]. =C2=A0 > h4. =C2=A0 > h4. How can we reproduce this non-stop communication bug? > It is not trivial to reproduce this bug, although we saw it occurring in = the wild. To reproduce this bug, we provide a script that utilizes docker a= nd that can be used to easily debug ZK code. The script starts a ZK cluster= with 3 participants (P1, P2, P3) and 2 observers (O1, O2). The script toge= ther with instructions on how to use it can be found [here|https://github.c= om/insumity/zookeeper_debug_tool]. > =C2=A0 > Using the script, there are at least 2 ways to reproduce the bug: > 1) We can artificially delay the leader election by introducing the follo= wing code in {{FastLeaderElection}}=C2=A0(in [here|https://github.com/apach= e/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper= /server/quorum/FastLeaderElection.java#L1006]). > =C2=A0 > {code:java} > // Verify if there is any change in the proposed leader > int time =3D finalizeWait; > if (self.getId() >=3D 1 && self.getId() <=3D 3) { > time =3D 2000; > }{code} > =C2=A0 > and changing the immediate succeeding line: > {code:java} > while ((n =3D recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) !=3D n= ull) {code} > to=20 > =C2=A0 > {code:java} > while ((n =3D recvqueue.poll(time, TimeUnit.MILLISECONDS)) !=3D null) {= =C2=A0 > {code} > Now, if we run a ZK cluster and force a leader election by killing the le= ader, we see the non-stop communication occurring. The reason is that=C2=A0= as a result of this delay the observer restarts (increments its {{logicalc= lock}}), tries to connect to the previous leader, but fails since the previ= ous leader is crashed, and the observer restarts by incrementing {{logicalc= lock}}=C2=A0once more and hence starting the non-stop communication. > 2) Another way to reproduce the bug is by creating a network partition th= at partitions P1 from P2, P3, O2 but that still keeps participant P1 connec= ted to observer O1. In such a case, the non-stop communication ensues since= O1 is {{OBSERVING}} while P1 remains in a {{LOOKING}} state. To reproduce = this bug, using the above script, someone just has to do: > * =C2=A0wait till the ZK cluster starts running > * =C2=A0in your local machine do ./create_np_case_3.sh (attached file in= this issue) > * =C2=A0force a leader election by restarting the leader (most likely th= e leader is server 3) > It is true that scenario 2 is slightly unrealistic. However, the first sc= enario where leader election takes too much time to complete is pretty real= istic.=C2=A0 Whenever we saw this non-stop communication bug, it was becaus= e leader election took too long to complete. For instance, it could occur i= f there is some type of split-vote during LE and the elected leader times o= ut while > {noformat} > waiting for epoch from quorum {noformat} > [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/sr= c/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1350]. > =C2=A0 > h4.=20 > How can we fix this issue? > One idea would be that before an observer starts observing a leader, it v= erifies that the leader is up and running using a check similar to=C2=A0{{c= heckLeader}}=C2=A0as is done [here|https://github.com/apache/zookeeper/blob= /master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/F= astLeaderElection.java#L1037]. > This will prevent from having non-stop communication between observers an= d participants during long leader elections, since observers do not try to = connect to an already failed leader, and hence they will not increase their= {{logicalclock}}.=C2=A0However, this fix on its own does not solve the 2nd= way to reproduce the bug that was described above. > Best Regards, > Karolos > =C2=A0 -- This message was sent by Atlassian Jira (v8.3.2#803003)