From users-return-50157-archive-asf-public=cust-asf.ponee.io@activemq.apache.org Fri Jun 15 00:40:07 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 95807180600 for ; Fri, 15 Jun 2018 00:40:06 +0200 (CEST) Received: (qmail 42030 invoked by uid 500); 14 Jun 2018 22:40:05 -0000 Mailing-List: contact users-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@activemq.apache.org Delivered-To: mailing list users@activemq.apache.org Received: (qmail 41951 invoked by uid 99); 14 Jun 2018 22:40:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2018 22:40:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 48566182E7B for ; Thu, 14 Jun 2018 22:40:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.412 X-Spam-Level: X-Spam-Status: No, score=-2.412 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=oracle.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id dnIzAPXjqqAo for ; Thu, 14 Jun 2018 22:40:02 +0000 (UTC) Received: from userp2130.oracle.com (userp2130.oracle.com [156.151.31.86]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 0A0C65F545 for ; Thu, 14 Jun 2018 22:40:01 +0000 (UTC) Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w5EMcuvT050931 for ; Thu, 14 Jun 2018 22:40:01 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : content-type : content-transfer-encoding : mime-version : subject : date : references : to : in-reply-to : message-id; s=corp-2017-10-26; bh=gezjVr1fg62Jn+Rbo3ysVyk9CZ8qhfMphvA6hGtiIKc=; b=r2r/cwPUdRZGCcQ8MpcL9nW8g5Y637Sr5wGGhDop1bgptKoonYd31eXo5QLApm3MGXis wS4YG+NW4t41BlIcQA8+ALB5Jpaj/DyCEu3BdAZn4dNhH+aSS69BN1SI8FQ9GE0fRNw9 ccxGO84tHnty6AxujM8Jr3fNHncsShWqL9F5589Cb/mI0guU8JK9bQNSG2NRWz93Bq74 IwYI0UZGK8Gwdaan7r0QJQgy88MACTJ41lQj17JIuBm6U2KfFVtdf0w73L3LbPls3e5X mOi81nIKqWDpOTckjciya4k2KLobB8OUpGtEsdC5ecmZ0b+bdhu7cN1aUly2/QxqP7q3 wQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2jk0xrewq7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 14 Jun 2018 22:40:00 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w5EMe03k030812 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 14 Jun 2018 22:40:00 GMT Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w5EMe0wl021488 for ; Thu, 14 Jun 2018 22:40:00 GMT Received: from [10.74.101.117] (/10.74.101.117) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 14 Jun 2018 15:39:59 -0700 From: Anindya Haldar Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: Questions on HA cluster and split brain Date: Thu, 14 Jun 2018 15:39:57 -0700 References: <380E03C4-EFE3-44F4-B9A9-DEEC5264D386@oracle.com> <54F077F5-C759-495E-97DD-B7FFBA6E2ECB@oracle.com> To: users@activemq.apache.org In-Reply-To: Message-Id: <8420EB13-5A04-4A1D-850E-2DEDC2A7875F@oracle.com> X-Mailer: Apple Mail (2.3445.5.20) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8924 signatures=668702 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=1 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1805220000 definitions=main-1806140249 Thanks, again, for your quick response. Anindya Haldar Oracle Marketing Cloud > On Jun 14, 2018, at 3:34 PM, Justin Bertram = wrote: >=20 >> 1) It is possible to define multiple groups within a cluster, and a > subset of the brokers in the cluster can be members of a specific = group. Is > that correct? >=20 > Yes. >=20 >> 2) The live-backup relationship is guided by group membership, when = there > is explicit group membership defined. Is that correct? >=20 > Yes. >=20 >> 3) When a backup or a live server in a group starts the quorum voting > process, other live servers in the cluster, even if though they may = not be > part of the same group, can participate in the quorum. Meaning the = ability > to participate in quorum voting is defined by cluster membership, and = not > by group membership within the cluster. Is that understanding correct? >=20 > Yes. >=20 >=20 > In short, a "group" allows the pairing of specific live and backup = brokers > together in the replicated HA use-case. >=20 >=20 > Justin >=20 >=20 > On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar = > wrote: >=20 >> I have a few quick follow up questions. =46rom the discussion here, = and from >> what I understand reading the Artemis manual, here is my = understanding >> about the idea of a cluster vs. the idea of a group within a cluster: >>=20 >> 1) It is possible to define multiple groups within a cluster, and a = subset >> of the brokers in the cluster can be members of a specific group. Is = that >> correct? >>=20 >> 2) The live-backup relationship is guided by group membership, when = there >> is explicit group membership defined. Is that correct? >>=20 >> 3) When a backup or a live server in a group starts the quorum voting >> process, other live servers in the cluster, even if though they may = not be >> part of the same group, can participate in the quorum. Meaning the = ability >> to participate in quorum voting is defined by cluster membership, and = not >> by group membership within the cluster. Is that understanding = correct? >>=20 >> Thanks, >>=20 >> Anindya Haldar >> Oracle Marketing Cloud >>=20 >>=20 >>> On Jun 14, 2018, at 9:57 AM, Anindya Haldar = >> wrote: >>>=20 >>> Many thanks, Justin. This makes things much clearer for us when it = comes >> to designing the HA cluster. >>>=20 >>> As for the Artemis evaluation scope, we want to use it as one of the >> supported messaging backbones in our application suite. The = application >> suite requires strong transactional guarantees, high availability, = and high >> performance and scale, amongst other things. We are looking towards a = full >> blown technology evaluation with those needs in mind. >>>=20 >>> Thanks, >>>=20 >>> Anindya Haldar >>> Oracle Marketing Cloud >>>=20 >>>=20 >>>> On Jun 13, 2018, at 7:23 PM, Justin Bertram >> wrote: >>>>=20 >>>>> Q1: At this point, will the transaction logs replicate from A to = C? >>>>=20 >>>> No. A will be replicating to B since B is the designated backup. >> Also, by >>>> "transaction logs" I assume you mean what the Artemis documentation >> refers >>>> to as the journal (i.e. all persistent message data). >>>>=20 >>>>> Q2: At this point will C become to new new back up for B, assuming = A >>>> remains in failed state? >>>>=20 >>>> Yes. >>>>=20 >>>>> Q3: If the answer to Q2 is yes, B will start replicating its = journals >> to >>>> C; is that correct? >>>>=20 >>>> Yes. >>>>=20 >>>>> Q4: At this point, which nodes are expected to participate in = quorum >>>> voting? All of A, B and C? Or A and C only (B excludes itself from = the >>>> set)? When it says "half the servers=E2=80=9D, I read it in a way = that B >> includes >>>> itself in the quorum voting. Is that the case? >>>>=20 >>>> A would be the only server available to participate in the quorum = voting >>>> since it is the only live server. However, since B can't reach A = then B >>>> would not receive any quorum vote responses. B doesn't vote; it = simply >>>> asks for a vote. >>>>=20 >>>>> Q5: This implies only the live servers participate in quorum = voting. Is >>>> that correct? >>>>=20 >>>> Yes. >>>>=20 >>>>> Q6: If the answer to Q5 is yes, then how does the split brain = detection >>>> (as described in the quoted text right before Q4) work? >>>>=20 >>>> It works by having multiple voting members (i.e. live servers) in = the >>>> cluster. The topology you've described with a single live and 2 >> backups is >>>> not sufficient to mitigate against split brain. >>>>=20 >>>>> Q7: The text implies that in order to avoid split brain, a cluster >> needs >>>> at least 3 live/backup PAIRS. >>>>=20 >>>> That is correct - 3 live/backup pairs. >>>>=20 >>>>> To me that implies at least 6 broker instances are needed in such = a >>>> cluster; but that is kind of hard to believe, and I feel (I may be >> wrong) >>>> it actually means 3 broker instances, assuming scenarios 1 and 2 as >>>> described earlier are valid ones. Can you please clarify? >>>>=20 >>>> What you feel is incorrect. That said, the live & backup instances = can >> be >>>> colocated which means although there are 6 total broker instances = only 3 >>>> machines are required. >>>>=20 >>>> I think implementing a feature whereby backups can participate in = the >>>> quorum vote would be a great addition to the broker. Unfortunately = I >>>> haven't had time to contribute such a feature. >>>>=20 >>>>=20 >>>> If I may ask a question of my own...Your emails to this list have >> piqued my >>>> interest and I'm curious to know to what end you are evaluating = Artemis >>>> since you apparently work for Oracle on a cloud related team and = Oracle >>>> already has a cloud messaging solution. Can you elaborate at all? >>>>=20 >>>>=20 >>>> Justin >>>>=20 >>>>=20 >>>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar < >> anindya.haldar@oracle.com> >>>> wrote: >>>>=20 >>>>> BTW, these are questions related to Artemis 2.4.0, which is what = we are >>>>> evaluating right now for our solution. >>>>>=20 >>>>>=20 >>>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar < >> anindya.haldar@oracle.com> >>>>> wrote: >>>>>>=20 >>>>>> I have some questions related to the HA cluster, failover and >>>>> split-brain cases. >>>>>>=20 >>>>>> Suppose I have set up a 3 node cluster with: >>>>>>=20 >>>>>> A =3D master >>>>>> B =3D slave 1 >>>>>> C =3D slave 2 >>>>>>=20 >>>>>> Also suppose they are all part of same group, and are set up to = offer >>>>> replication based HA. >>>>>>=20 >>>>>> Scenario 1 >>>>>> =3D=3D=3D=3D=3D=3D=3D=3D >>>>>> Say, >>>>>>=20 >>>>>> B starts up and finds A >>>>>> B becomes the designated backup for A >>>>>> C starts up, and tries to find a live server in this group >>>>>> C figures that A already has a designated backup, which is B >>>>>> C keeps waiting until the network topology is changed >>>>>>=20 >>>>>>=20 >>>>>> Q1: At this point, will the transaction logs replicate from A to = C? >>>>>>=20 >>>>>> Now let=E2=80=99s say >>>>>>=20 >>>>>> Node A (the current master) fails >>>>>> B becomes the new master >>>>>>=20 >>>>>> Q2: At this point will C become to new new back up for B, = assuming A >>>>> remains in failed state? >>>>>>=20 >>>>>> Q3: If the answer to Q2 is yes, B will start replicating its = journals >> to >>>>> C; is that correct? >>>>>>=20 >>>>>>=20 >>>>>> Scenario 2 (split brain detection case) >>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D >>>>>> Say, >>>>>>=20 >>>>>> B detects a transient network failure with A >>>>>> B wants to figure out if it needs to take over and be the new = master >>>>>> B starts a quorum voting process >>>>>>=20 >>>>>> The manual says this in the =E2=80=98High Availability and = Failover=E2=80=99 section: >>>>>>=20 >>>>>> "Specifically, the backup will become active when it loses = connection >> to >>>>> its live server. This can be problematic because this can also = happen >>>>> because of a temporary network problem. In order to address this >> issue, the >>>>> backup will try to determine whether it still can connect to the = other >>>>> servers in the cluster. If it can connect to more than half the >> servers, it >>>>> will become active, if more than half the servers also disappeared >> with the >>>>> live, the backup will wait and try reconnecting with the live. = This >> avoids >>>>> a split brain situation." >>>>>>=20 >>>>>> Q4: At this point, which nodes are expected to participate in = quorum >>>>> voting? All of A, B and C? Or A and C only (B excludes itself from = the >>>>> set)? When it says "half the servers=E2=80=9D, I read it in a way = that B >> includes >>>>> itself in the quorum voting. Is that the case? >>>>>>=20 >>>>>> Whereas in the =E2=80=98Avoiding Network Isolation=E2=80=99 = section, the manual says >>>>> this: >>>>>>=20 >>>>>> =E2=80=9CQuorum voting is used by both the live and the backup to = decide what >> to >>>>> do if a replication connection is disconnected. Basically the = server >> will >>>>> request each live server in the cluster to vote as to whether it >> thinks the >>>>> server it is replicating to or from is still alive. This being the >> case the >>>>> minimum number of live/backup pairs needed is 3." >>>>>>=20 >>>>>> Q5: This implies only the live servers participate in quorum = voting. >> Is >>>>> that correct? >>>>>>=20 >>>>>> Q6: If the answer to Q5 is yes, then how does the split brain >> detection >>>>> (as described in the quoted text right before Q4) work? >>>>>>=20 >>>>>> Q7: The text implies that in order to avoid split brain, a = cluster >> needs >>>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker >>>>> instances are needed in such a cluster; but that is kind of hard = to >>>>> believe, and I feel (I may be wrong) it actually means 3 broker >> instances, >>>>> assuming scenarios 1 and 2 as described earlier are valid ones. = Can you >>>>> please clarify? >>>>>>=20 >>>>>> Would appreciate if someone can offer clarity on these questions. >>>>>>=20 >>>>>> Thanks, >>>>>> Anindya Haldar >>>>>> Oracle Marketing Cloud >>>>>>=20 >>>>>=20 >>>>>=20 >>>=20 >>=20 >>=20