activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johannes F. Knauf" <johannes.kn...@ancud.de>
Subject HA in Master/Slave with shared mKahaDb not really HA because of slow failover?
Date Tue, 17 Jan 2017 07:51:24 GMT
Hi,

I filed a bug with JIRA about HA in Master/Slave mode with shared mKahaDb not being really
HA
because of extremely slow failover.  Depending on the message load startup time of the Slave
when
becoming Master can be seriously slowed down (in the order of minutes) which yields an extremely
slow failover and hence a phase of unavailability of the broker.

https://issues.apache.org/jira/browse/AMQ-6564

Timothy Bish suggested to discuss this issue first here on the Users Mailing List. So I gladly
repost.

---

Consider the following scenario:
* AMQ Host A and Host B are configured exactly the same
* Host A and Host B share a common filesystem storage for their (m)kahadb in order to create
HA as
described in http://activemq.apache.org/shared-file-system-master-slave.html
* high-traffic scenario, where at each point in time quite some amount of messages is still
in each
queue

Expected:
Given Host A is current master and Host B is polling for the lock every 10 seconds (default),
when Host A is going down,
then Host B should be able to serve producer enqueue requests after 10 seconds + some microseconds
at max.

Reality:
Host B needs to replay the whole journals before being available to accept new messages again.
This
can take a long time, especially if consistency checks are required. This means Master/Slave
with
shared FS is not really providing HA.

It is perfectly understandable, that for consumers the failover takes that long. They can
only
continue receiving messages, when all journals have been read. Otherwise order of messages
would be
destroyed.

For producers this is not the case, as AMQ could just create a fresh journal file and start
appending immediately. Am I wrong?

Also it seems, that each kahaDB in an mKahaDB is checked in sequence, so that in worst case
even
less filled queues are not available before everything is checked completely.

Long unavailability for producers is unacceptable in most scenarios. It means that all producing
clients have to take a serious amount of effort to protect against these scenarios in order
not to
lose messages (buffering, etc.). Or is there a best practise workaround?


---

Any ideas why it is like that?

Thanks,
Johannes

Mime
View raw message