activemq-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Catalin Alexandru Zamfir (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ARTEMIS-1285) Standby slave would not announce replication to master when the slave is down
Date Wed, 18 Apr 2018 09:21:00 GMT

    [ https://issues.apache.org/jira/browse/ARTEMIS-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442174#comment-16442174
] 

Catalin Alexandru Zamfir edited comment on ARTEMIS-1285 at 4/18/18 9:20 AM:
----------------------------------------------------------------------------

Ok, so we have 2 issues here:
 * on the 2nd standby backup if you have the Hawtio console open it will complain with AMQ222040:
Server is stopped on the logs, spamming the logs for the amount of time you keep the browser
tab open;
 ** message was misleading, pointing me to this ticket in the end;
 * we've tested all possible failure scenarios;
 ** the one in the examples works, when master fails first, then 1st backup becomes live and
the 2nd backup becomes active backup to the current live (former 1st backup);
 ** *if however the master is live but the 1st backup fails*, the issue here in ARTEMIS-1285,
the 2nd-ary slave doesn't take over as active backup. Doesn't vote, doesn't detect the 1st
backup failed. Just sits and waits.
 *** if we restarted the 1st backup, restoring "the link" then failure occurs as in the examples;

Fact is, real-life failures are unpredictable and there should be a "competition" between
backups (eg. ZK leader election of backup instances) to the one single "live". So that if
your 1st backup fails but your master has not yet failed, you will get your 2nd-ary backup
in sync with your master. This would improve the fault tolerance of the cluster as a whole.

I agree with Denis here that either we document this specific situation (ARTEMIS-1285) so
that people are aware that the 2nd-ary backup is there only if this specific failure scenario
happens (master first, 1st backup then 2nd backup) as in the examples. Or the logic promotes
some competition between the backups (by voting who wins the ability to become backup for
the given live in the group).


was (Author: antauri):
Ok, so we have 2 issues here:
 * on the 2nd standby backup if you have the Hawtio console open it will complain with AMQ222040:
Server is stopped on the logs, spamming the logs for the amount of time you keep the browser
tab open;
 ** message was misleading, pointing me to this ticket in the end;
 * we've tested all possible failure scenarios;
 ** the one in the examples works, when master fails first, then 1st backup becomes live and
the 2nd backup becomes active backup to the current live (former 1st backup);
 ** *if however the master is live but the 1st backup fails*, the issue here in ARTEMIS-1285,
the 2nd-ary slave doesn't take over as active backup. Doesn't vote, doesn't detect the 1st
backup failed. Just sits and waits.
 *** if we restarted the 1st backup, restoring "the link" then failure occurs as in the examples;

Fact is, real-life there should be a "competition" between backups (eg. ZK leader election
of backup instances) to the one single "live". So that if your 1st backup fails but your master
has not yet failed, you will get your 2nd-ary backup in sync with your master. This would
improve the fault tolerance of the cluster as a whole.

I agree with Denis here that either we document this specific situation (ARTEMIS-1285) so
that people are aware that the 2nd-ary backup is there only if this specific failure scenario
happens (master first, 1st backup then 2nd backup) as in the examples. Or the logic promotes
some competition between the backups (by voting who wins the ability to become backup for
the given live in the group).

> Standby slave would not announce replication to master when the slave is down
> -----------------------------------------------------------------------------
>
>                 Key: ARTEMIS-1285
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-1285
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.1.0
>            Reporter: yangwei
>            Priority: Major
>
> We have a cluster of 3 instances: A is master, B is slave and C is standby slave. When
slave is down, we expect C announces replication to A but A is in standalone mode all the
time. We see C waits at "nodeLocator.locateNode()" through jstack command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message