From users-return-50158-archive-asf-public=cust-asf.ponee.io@activemq.apache.org  Fri Jun 15 01:48:29 2018
Return-Path: <users-return-50158-archive-asf-public=cust-asf.ponee.io@activemq.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 534F2180600
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 15 Jun 2018 01:48:28 +0200 (CEST)
Received: (qmail 47250 invoked by uid 500); 14 Jun 2018 23:48:27 -0000
Mailing-List: contact users-help@activemq.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:users-help@activemq.apache.org>
List-Unsubscribe: <mailto:users-unsubscribe@activemq.apache.org>
List-Post: <mailto:users@activemq.apache.org>
List-Id: <users.activemq.apache.org>
Reply-To: users@activemq.apache.org
Delivered-To: mailing list users@activemq.apache.org
Received: (qmail 47229 invoked by uid 99); 14 Jun 2018 23:48:26 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2018 23:48:26 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F36A21801AF
	for <users@activemq.apache.org>; Thu, 14 Jun 2018 23:48:25 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.899
X-Spam-Level: *
X-Spam-Status: No, score=1.899 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id X3f6DTRZm3Yo for <users@activemq.apache.org>;
	Thu, 14 Jun 2018 23:48:21 +0000 (UTC)
Received: from mail-wr0-f174.google.com (mail-wr0-f174.google.com [209.85.128.174])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 267145F169
	for <users@activemq.apache.org>; Thu, 14 Jun 2018 23:48:21 +0000 (UTC)
Received: by mail-wr0-f174.google.com with SMTP id o12-v6so8111640wrm.12
        for <users@activemq.apache.org>; Thu, 14 Jun 2018 16:48:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=ZvgDw0mnWndWDtyYj2vKU2dz1+UsNXgxGE0zA+PZxY0=;
        b=EkHjweW4jmXdvgrSshQ4RjegyKZ/vnQQCbc0lq8FPmXL2HvIkpNhLA5gGkrdk2ri3n
         Q1SnzS/AQajWY9TC/1tUGaC1Og3z6jxYR7y6U9vbu0j+eIU4OXAqDmndk42aFZP5H/LF
         B8Jx3bhoUIcpeO/jmtcQbxlSdAaWdapXRc1h8ckOFE4p5EXr+u5mUU2LfHurKXfzQkGv
         hQpQseIY2mR6P/PCH0gKfiLQyoXjHFa38RiVd5PVkSRSsSqzd6Jas5oRSy3tH51xA4HK
         JnibpY+0quPtbrijw0nK5o8UP78f2QxIT5Kz8VSvt8TZ1hvidbBF8m0NQKAXcIBbadSg
         iMtA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=ZvgDw0mnWndWDtyYj2vKU2dz1+UsNXgxGE0zA+PZxY0=;
        b=FzU80RiUlOP95Iu2biA/zEPoeLZ8nauhfyEAacYQtAbqI2Sa3WSEaZCL3OvJk4bc1W
         1gJPZoDQnLu47w1+hEpvbmhxCSoGru1+Qb3puxX1Pg04BDKNZtxO9zI1QFWG0VuD51Mw
         KBRLJ6Uu/0mcfHkiXDJpa5Mn7zUGlN3mgRwV1iyNrt6SM1dYMTV8iE2SYmjNPNG9R/6C
         QnHn35F5qDO0g4FekuFYMbKu52TquLsn0A0gyP9xuUeKjsCH6neh2UNuA+cFEAKurnRS
         wHZrKdk8bufX4OoAPGc47LSGbOUNtejo4uwgOac7RHz9ATnodsANXlJ5nsKDXAaMo6pu
         z3ig==
X-Gm-Message-State: APt69E1mj/Pms2cr1df21W5aHEsYSIPoaJ79uTBwR6FaljK2Liha8Gst
	C3kJOD95T5MGweGRrLOezfPN8SW2nUoKPP6CnIY6NQ==
X-Google-Smtp-Source: ADUXVKKug0wDkNoMPfeDG2jImdMcpYg139FPLUS3i5SwR0KGfbyppS0i4w8uQ9u44B2weV+KIg6mukfvy1hJCGeSImw=
X-Received: by 2002:adf:d10a:: with SMTP id a10-v6mr3839566wri.18.1529020100274;
 Thu, 14 Jun 2018 16:48:20 -0700 (PDT)
MIME-Version: 1.0
References: <CECBE008-A965-4A44-974D-912F4988A1CA@oracle.com>
 <380E03C4-EFE3-44F4-B9A9-DEEC5264D386@oracle.com> <CAF+kE=RmfrYX9v_RrUA4WwZjDf0cWv-XnxSfEXXmY05qfnwDdw@mail.gmail.com>
 <54F077F5-C759-495E-97DD-B7FFBA6E2ECB@oracle.com> <F580D4DF-F29F-4EE7-AE8A-C1F69D505809@oracle.com>
 <CAF+kE=Ts7RbWbz+uNk3-r-wRdpQ8kY4XkurEdogGUEV6u5mB_Q@mail.gmail.com> <8420EB13-5A04-4A1D-850E-2DEDC2A7875F@oracle.com>
In-Reply-To: <8420EB13-5A04-4A1D-850E-2DEDC2A7875F@oracle.com>
From: Clebert Suconic <clebert.suconic@gmail.com>
Date: Thu, 14 Jun 2018 19:48:09 -0400
Message-ID: <CAKF+bsosgY4Ljh4JsKAK03bj659S6csPrSiDqe0ZqbpPWXGS2w@mail.gmail.com>
Subject: Re: Questions on HA cluster and split brain
To: users@activemq.apache.org
Content-Type: multipart/alternative; boundary="000000000000ee7031056ea2bb7c"

--000000000000ee7031056ea2bb7c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I think you should use 2.6.1.  There is nothing that it is not equivalent.
And where we are actively fixing issues now.

On Thu, Jun 14, 2018 at 6:40 PM Anindya Haldar <anindya.haldar@oracle.com>
wrote:

> Thanks, again, for your quick response.
>
> Anindya Haldar
> Oracle Marketing Cloud
>
>
> > On Jun 14, 2018, at 3:34 PM, Justin Bertram <jbertram@apache.org> wrote=
:
> >
> >> 1) It is possible to define multiple groups within a cluster, and a
> > subset of the brokers in the cluster can be members of a specific group=
.
> Is
> > that correct?
> > her
> > Yes.
> >
> >> 2) The live-backup relationship is guided by group membership, when
> there
> > is explicit group membership defined. Is that correct?
> >
> > Yes.
> >
> >> 3) When a backup or a live server in a group starts the quorum voting
> > process, other live servers in the cluster, even if though they may not
> be
> > part of the same group, can participate in the quorum. Meaning the
> ability
> > to participate in quorum voting is defined by cluster membership, and n=
ot
> > by group membership within the cluster. Is that understanding correct?
> >
> > Yes.
> >
> >
> > In short, a "group" allows the pairing of specific live and backup
> brokers
> > together in the replicated HA use-case.
> >
> >
> > Justin
> >
> >
> > On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar <
> anindya.haldar@oracle.com>
> > wrote:
> >
> >> I have a few quick follow up questions. From the discussion here, and
> from
> >> what I understand reading the Artemis manual, here is my understanding
> >> about the idea of a cluster vs. the idea of a group within a cluster:
> >>
> >> 1) It is possible to define multiple groups within a cluster, and a
> subset
> >> of the brokers in the cluster can be members of a specific group. Is
> that
> >> correct?
> >>
> >> 2) The live-backup relationship is guided by group membership, when
> there
> >> is explicit group membership defined. Is that correct?
> >>
> >> 3) When a backup or a live server in a group starts the quorum voting
> >> process, other live servers in the cluster, even if though they may no=
t
> be
> >> part of the same group, can participate in the quorum. Meaning the
> ability
> >> to participate in quorum voting is defined by cluster membership, and
> not
> >> by group membership within the cluster. Is that understanding correct?
> >>
> >> Thanks,
> >>
> >> Anindya Haldar
> >> Oracle Marketing Cloud
> >>
> >>
> >>> On Jun 14, 2018, at 9:57 AM, Anindya Haldar <anindya.haldar@oracle.co=
m
> >
> >> wrote:
> >>>
> >>> Many thanks, Justin. This makes things much clearer for us when it
> comes
> >> to designing the HA cluster.
> >>>
> >>> As for the Artemis evaluation scope, we want to use it as one of the
> >> supported messaging backbones in our application suite. The applicatio=
n
> >> suite requires strong transactional guarantees, high availability, and
> high
> >> performance and scale, amongst other things. We are looking towards a
> full
> >> blown technology evaluation with those needs in mind.
> >>>
> >>> Thanks,
> >>>
> >>> Anindya Haldar
> >>> Oracle Marketing Cloud
> >>>
> >>>
> >>>> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jbertram@apache.org>
> >> wrote:
> >>>>
> >>>>> Q1: At this point, will the transaction logs replicate from A to C?
> >>>>
> >>>> No.  A will be replicating to B since B is the designated backup.
> >> Also, by
> >>>> "transaction logs" I assume you mean what the Artemis documentation
> >> refers
> >>>> to as the journal (i.e. all persistent message data).
> >>>>
> >>>>> Q2: At this point will C become to new new back up for B, assuming =
A
> >>>> remains in failed state?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Q3: If the answer to Q2 is yes, B will start replicating its journa=
ls
> >> to
> >>>> C; is that correct?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Q4: At this point, which nodes are expected to participate in quoru=
m
> >>>> voting? All of A, B and C? Or A and C only (B excludes itself from t=
he
> >>>> set)? When it says "half the servers=E2=80=9D, I read it in a way th=
at B
> >> includes
> >>>> itself in the quorum voting. Is that the case?
> >>>>
> >>>> A would be the only server available to participate in the quorum
> voting
> >>>> since it is the only live server.  However, since B can't reach A
> then B
> >>>> would not receive any quorum vote responses.  B doesn't vote; it
> simply
> >>>> asks for a vote.
> >>>>
> >>>>> Q5: This implies only the live servers participate in quorum voting=
.
> Is
> >>>> that correct?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Q6: If the answer to Q5 is yes, then how does the split brain
> detection
> >>>> (as described in the quoted text right before Q4) work?
> >>>>
> >>>> It works by having multiple voting members (i.e. live servers) in th=
e
> >>>> cluster.  The topology you've described with a single live and 2
> >> backups is
> >>>> not sufficient to mitigate against split brain.
> >>>>
> >>>>> Q7: The text implies that in order to avoid split brain, a cluster
> >> needs
> >>>> at least 3 live/backup PAIRS.
> >>>>
> >>>> That is correct - 3 live/backup pairs.
> >>>>
> >>>>> To me that implies at least 6 broker instances are needed in such a
> >>>> cluster; but that is kind of hard to believe, and I feel (I may be
> >> wrong)
> >>>> it actually means 3 broker instances, assuming scenarios 1 and 2 as
> >>>> described earlier are valid ones. Can you please clarify?
> >>>>
> >>>> What you feel is incorrect.  That said, the live & backup instances
> can
> >> be
> >>>> colocated which means although there are 6 total broker instances
> only 3
> >>>> machines are required.
> >>>>
> >>>> I think implementing a feature whereby backups can participate in th=
e
> >>>> quorum vote would be a great addition to the broker.  Unfortunately =
I
> >>>> haven't had time to contribute such a feature.
> >>>>
> >>>>
> >>>> If I may ask a question of my own...Your emails to this list have
> >> piqued my
> >>>> interest and I'm curious to know to what end you are evaluating
> Artemis
> >>>> since you apparently work for Oracle on a cloud related team and
> Oracle
> >>>> already has a cloud messaging solution.  Can you elaborate at all?
> >>>>
> >>>>
> >>>> Justin
> >>>>
> >>>>
> >>>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <
> >> anindya.haldar@oracle.com>
> >>>> wrote:
> >>>>
> >>>>> BTW, these are questions related to Artemis 2.4.0, which is what we
> are
> >>>>> evaluating right now for our solution.
> >>>>>
> >>>>>
> >>>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <
> >> anindya.haldar@oracle.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> I have some questions related to the HA cluster, failover and
> >>>>> split-brain cases.
> >>>>>>
> >>>>>> Suppose I have set up a 3 node cluster with:
> >>>>>>
> >>>>>> A =3D master
> >>>>>> B =3D slave 1
> >>>>>> C =3D slave 2
> >>>>>>
> >>>>>> Also suppose they are all part of same group, and are set up to
> offer
> >>>>> replication based HA.
> >>>>>>
> >>>>>> Scenario 1
> >>>>>> =3D=3D=3D=3D=3D=3D=3D=3D
> >>>>>> Say,
> >>>>>>
> >>>>>> B starts up and finds A
> >>>>>> B becomes the designated backup for A
> >>>>>> C starts up, and tries to find a live server in this group
> >>>>>> C figures that A already has a designated backup, which is B
> >>>>>> C keeps waiting until the network topology is changed
> >>>>>>
> >>>>>>
> >>>>>> Q1: At this point, will the transaction logs replicate from A to C=
?
> >>>>>>
> >>>>>> Now let=E2=80=99s say
> >>>>>>
> >>>>>> Node A (the current master) fails
> >>>>>> B becomes the new master
> >>>>>>
> >>>>>> Q2: At this point will C become to new new back up for B, assuming=
 A
> >>>>> remains in failed state?
> >>>>>>
> >>>>>> Q3: If the answer to Q2 is yes, B will start replicating its
> journals
> >> to
> >>>>> C; is that correct?
> >>>>>>
> >>>>>>
> >>>>>> Scenario 2 (split brain detection case)
> >>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
> >>>>>> Say,
> >>>>>>
> >>>>>> B detects a transient network failure with A
> >>>>>> B wants to figure out if it needs to take over and be the new mast=
er
> >>>>>> B starts a quorum voting process
> >>>>>>
> >>>>>> The manual says this in the =E2=80=98High Availability and Failove=
r=E2=80=99
> section:
> >>>>>>
> >>>>>> "Specifically, the backup will become active when it loses
> connection
> >> to
> >>>>> its live server. This can be problematic because this can also happ=
en
> >>>>> because of a temporary network problem. In order to address this
> >> issue, the
> >>>>> backup will try to determine whether it still can connect to the
> other
> >>>>> servers in the cluster. If it can connect to more than half the
> >> servers, it
> >>>>> will become active, if more than half the servers also disappeared
> >> with the
> >>>>> live, the backup will wait and try reconnecting with the live. This
> >> avoids
> >>>>> a split brain situation."
> >>>>>>
> >>>>>> Q4: At this point, which nodes are expected to participate in quor=
um
> >>>>> voting? All of A, B and C? Or A and C only (B excludes itself from
> the
> >>>>> set)? When it says "half the servers=E2=80=9D, I read it in a way t=
hat B
> >> includes
> >>>>> itself in the quorum voting. Is that the case?
> >>>>>>
> >>>>>> Whereas in the =E2=80=98Avoiding Network Isolation=E2=80=99 sectio=
n, the manual says
> >>>>> this:
> >>>>>>
> >>>>>> =E2=80=9CQuorum voting is used by both the live and the backup to =
decide
> what
> >> to
> >>>>> do if a replication connection is disconnected. Basically the serve=
r
> >> will
> >>>>> request each live server in the cluster to vote as to whether it
> >> thinks the
> >>>>> server it is replicating to or from is still alive. This being the
> >> case the
> >>>>> minimum number of live/backup pairs needed is 3."
> >>>>>>
> >>>>>> Q5: This implies only the live servers participate in quorum votin=
g.
> >> Is
> >>>>> that correct?
> >>>>>>
> >>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
> >> detection
> >>>>> (as described in the quoted text right before Q4) work?
> >>>>>>
> >>>>>> Q7: The text implies that in order to avoid split brain, a cluster
> >> needs
> >>>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
> >>>>> instances are needed in such a cluster; but that is kind of hard to
> >>>>> believe, and I feel (I may be wrong) it actually means 3 broker
> >> instances,
> >>>>> assuming scenarios 1 and 2 as described earlier are valid ones. Can
> you
> >>>>> please clarify?
> >>>>>>
> >>>>>> Would appreciate if someone can offer clarity on these questions.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Anindya Haldar
> >>>>>> Oracle Marketing Cloud
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>
> >>
>
> --
Clebert Suconic

--000000000000ee7031056ea2bb7c--