From issues-return-883-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Wed Sep  4 23:41:03 2019
Return-Path: <issues-return-883-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 518E1180674
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  5 Sep 2019 01:41:03 +0200 (CEST)
Received: (qmail 7441 invoked by uid 500); 5 Sep 2019 06:05:42 -0000
Mailing-List: contact issues-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:issues@zookeeper.apache.org>
List-Id: <issues.zookeeper.apache.org>
Reply-To: dev@zookeeper.apache.org
Delivered-To: mailing list issues@zookeeper.apache.org
Received: (qmail 7429 invoked by uid 99); 5 Sep 2019 06:05:42 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Sep 2019 06:05:42 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1250DE3116
	for <issues@zookeeper.apache.org>; Wed,  4 Sep 2019 23:41:02 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 8EC27782260
	for <issues@zookeeper.apache.org>; Wed,  4 Sep 2019 23:41:00 +0000 (UTC)
Date: Wed, 4 Sep 2019 23:41:00 +0000 (UTC)
From: "Michael Han (Jira)" <jira@apache.org>
To: issues@zookeeper.apache.org
Message-ID: <JIRA.13254428.1567530506000.2485.1567640460581@Atlassian.JIRA>
In-Reply-To: <JIRA.13254428.1567530506000@Atlassian.JIRA>
References: <JIRA.13254428.1567530506000@Atlassian.JIRA> <JIRA.13254428.1567530506066@jira-he-de>
Subject: [jira] [Commented] (ZOOKEEPER-3534) Non-stop communication between
 participants and observers.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3534?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
16922942#comment-16922942 ]=20

Michael Han commented on ZOOKEEPER-3534:
----------------------------------------

yes the problematic code path in this case is inside Messenger's sender and=
 receiver, where it does not use any of the min/max interval timeouts (the =
value is hardcoded as 3000 milli sec at the moment, which, seems like a pla=
ce to improve to make it configurable in a separate jira.).

Even we make the sender / recv queue poll timeout configurable for Messenge=
r, I am not sure if this will solve the issue. Once we run into such commun=
ication state, the queues will always be non empty so timeout does matter m=
uch here - until the logic bail out itself.=20

Avoid getting into such state in first place (proposed fix 1) seems a good =
approach.

> Non-stop communication between participants and observers.
> ----------------------------------------------------------
>
>                 Key: ZOOKEEPER-3534
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3534
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Karolos Antoniadis
>            Priority: Minor
>         Attachments: create_np_case_3.sh
>
>
> Hello ZooKeeper developers,
> there are cases during *leader election*, where there is non-stop communi=
cation between observers and participants.=20
> This communication occurs as follows:=20
> - an observer sends a notification to a participant
> - the participant responds
> - an observer sends another notification and so on and so forth ...
> It is possible that an observer-participant pair exchange hundreds of not=
ification messages in the span of one second. As a consequence, the system =
is burdened with unnecessary load, and the logs are filled with useless inf=
ormation as can be seen below:
> =C2=A0
> {noformat}
> 2019-09-03 16:37:22,630 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea=
derElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING=
, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers=
ion:0x2, n.config version:0x100000000
> 2019-09-03 16:37:22,632 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea=
derElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING=
, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers=
ion:0x2, n.config version:0x100000000
> 2019-09-03 16:37:22,633 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea=
derElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING=
, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers=
ion:0x2, n.config version:0x100000000
> 2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea=
derElection@692] - Notification: my state:LOOKING; n.sid:1, n.state:LOOKING=
, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers=
ion:0x2, n.config version:0x100000000
> 2019-09-03 16:37:22,635 [myid:4] - INFO [WorkerReceiver[myid=3D4]:FastLea=
derElection@692] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING=
, n.leader:3, n.round:0x2, n.peerEpoch:0x1, n.zxid:0x0, message format vers=
ion:0x2, n.config version:0x100000000{noformat}
> =C2=A0
> =C2=A0
> h4. Why does the non-stop communication bug occur?
> This bug stems from the fact that when a participant receives a notificat=
ion from an observer, the participant responds right away, as can be seen [=
here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/m=
ain/java/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L325]=
=C2=A0- it is even written in the comments. Now, when the observer receives=
 back the message from the participant there are 2 cases that could lead to=
 non-stop communication:
> 1) The observer has a greater {{logicalclock}}=C2=A0than the participant =
and both the observer and the participant are in a {{LOOKING}}=C2=A0state. =
In such a case, the observer responds right away to the participant as can =
be seen [here|https://github.com/apache/zookeeper/blob/master/zookeeper-ser=
ver/src/main/java/org/apache/zookeeper/server/quorum/FastLeaderElection.jav=
a#L392].=20
> 2) The observer is {{OBSERVING}}=C2=A0while the participant is still {{LO=
OKING}}, then the non-stop communication ensues due to the code in [here|ht=
tps://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/jav=
a/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L413]. =C2=A0
> h4. =C2=A0
> h4. How can we reproduce this non-stop communication bug?
> It is not trivial to reproduce this bug, although we saw it occurring in =
the wild. To reproduce this bug, we provide a script that utilizes docker a=
nd that can be used to easily debug ZK code. The script starts a ZK cluster=
 with 3 participants (P1, P2, P3) and 2 observers (O1, O2). The script toge=
ther with instructions on how to use it can be found [here|https://github.c=
om/insumity/zookeeper_debug_tool].
> =C2=A0
> Using the script, there are at least 2 ways to reproduce the bug:
> 1) We can artificially delay the leader election by introducing the follo=
wing code in {{FastLeaderElection}}=C2=A0(in [here|https://github.com/apach=
e/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper=
/server/quorum/FastLeaderElection.java#L1006]).
> =C2=A0
> {code:java}
> // Verify if there is any change in the proposed leader
> int time =3D finalizeWait;
> if (self.getId() >=3D 1 && self.getId() <=3D 3) {
>     time =3D 2000;
> }{code}
> =C2=A0
> and changing the immediate succeeding line:
> {code:java}
> while ((n =3D recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) !=3D n=
ull) {code}
> to=20
> =C2=A0
> {code:java}
> while ((n =3D recvqueue.poll(time, TimeUnit.MILLISECONDS)) !=3D null) {=
=C2=A0
> {code}
> Now, if we run a ZK cluster and force a leader election by killing the le=
ader, we see the non-stop communication occurring. The reason is that=C2=A0=
 as a result of this delay the observer restarts (increments its {{logicalc=
lock}}), tries to connect to the previous leader, but fails since the previ=
ous leader is crashed, and the observer restarts by incrementing {{logicalc=
lock}}=C2=A0once more and hence starting the non-stop communication.
> 2) Another way to reproduce the bug is by creating a network partition th=
at partitions P1 from P2, P3, O2 but that still keeps participant P1 connec=
ted to observer O1. In such a case, the non-stop communication ensues since=
 O1 is {{OBSERVING}} while P1 remains in a {{LOOKING}} state. To reproduce =
this bug, using the above script, someone just has to do:
>  * =C2=A0wait till the ZK cluster starts running
>  * =C2=A0in your local machine do ./create_np_case_3.sh (attached file in=
 this issue)
>  * =C2=A0force a leader election by restarting the leader (most likely th=
e leader is server 3)
> It is true that scenario 2 is slightly unrealistic. However, the first sc=
enario where leader election takes too much time to complete is pretty real=
istic.=C2=A0 Whenever we saw this non-stop communication bug, it was becaus=
e leader election took too long to complete. For instance, it could occur i=
f there is some type of split-vote during LE and the elected leader times o=
ut while
> {noformat}
> waiting for epoch from quorum {noformat}
> [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/sr=
c/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1350].
> =C2=A0
> h4.=20
> How can we fix this issue?
> One idea would be that before an observer starts observing a leader, it v=
erifies that the leader is up and running using a check similar to=C2=A0{{c=
heckLeader}}=C2=A0as is done [here|https://github.com/apache/zookeeper/blob=
/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/F=
astLeaderElection.java#L1037].
> This will prevent from having non-stop communication between observers an=
d participants during long leader elections, since observers do not try to =
connect to an already failed leader, and hence they will not increase their=
 {{logicalclock}}.=C2=A0However, this fix on its own does not solve the 2nd=
 way to reproduce the bug that was described above.
> Best Regards,
> Karolos
> =C2=A0


--
This message was sent by Atlassian Jira
(v8.3.2#803003)