activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miroslav Novak <>
Subject Re: Unreliable NFS exclusive locks on unreliable networks
Date Tue, 27 Mar 2018 11:38:20 GMT
Hi Korny,

I have experience with this only with ActiveMQ Artemis. What I've learned is that this might
be quite sensitive to NFSv4 mount options. In the end we've figured out that we need following
mount options:

I believe those will be good mount options for you as well. At least it will guarantee that
nothing stays cached in NFSv4 client on original master. Might be worth to try.


----- Original Message -----
> From: "dailyxe" <>
> To:
> Sent: Monday, March 26, 2018 9:31:38 PM
> Subject: Unreliable NFS exclusive locks on unreliable networks
> Hi guys, just wondering if anyone else has tested this and found similar
> problems.
> I've been testing ActiveMQ in a shared storage master/slave configuration,
> using an NFSv4 server for the shared storage.  I've tried this both with a
> standalone nfs server, and using Amazon's EFS server.
> My tests are looking into what happens when the network is unreliable -
> specifically, if for some reason the master ActiveMQ broker can't
> communicate with the NFS server.
> What I've been seeing, in a nutshell, is the following:
> - At startup, the Master gets exclusive access to the NFS lock file, and
> the Slave doesn't, and it loops waiting for the lock, as expected.
> - When I cut the Master off from the NFS server, the NFS server eventually
> times out the lock, and the Slave acquires it and starts up.  It gets a
> pile of journal errors, but it does eventually sort things out and start,
> and clients using the failover: protocol start sending messages to the
> slace.
> - Eventually, the Master notices that it is broken and tries to shut down.
> It takes a long time - I get a lot of warnings like:
> [KeepAlive Timer] INFO  TransportConnection            - The connection to
> 'tcp://' is taking a long time to shutdown.
> ... I'm guessing it's trying to gracefully shut down a listener or
> something?  Anyway, eventually I get a DB failure and it dies.
> The problem though, is that the Master re-starts itself - as it should.
> And in the meantime I've repaired the connection to the NFS server.  So the
> master should now try to grab the exclusive lock and fail, and become a
> slave instead.
> However, this generally doesn't seem to happen.  The master restarts, with
> no lock errors, and I have two brokers both thinking they own the same
> NFS-based database.  Not a good situation.  (Once, I had a situation where
> the master did seem to block waiting for a lock, but I haven't been able to
> reproduce that behaviour)
> Has anyone else seen this?  None of this would affect a situation where the
> master broker crashed or was restarted - that should be fine - but it seems
> quite unreliable when a network split occurs, at least from our testing so
> far.
> Note that this may be related to a problem with Java and exclusive file
> locks, which I raised the other day on Stack Overflow:
> the TL;DR is that the FileLock.isValid() check that is used in
> org.apache.activemq.util.LockFile.keepAlive() is pointless - it doesn't
> actually check that the lock is still valid, just that no other thread in
> the same JVM has killed the lock.
> However the LockFile.keepAlive code:
>     public boolean keepAlive() {
>         return lock != null && lock.isValid() && file.exists();
>     }
> ... should still fail, as file.exists() should fail if the NFS server has
> gone away.  (though it's possible this will block rather than failing...)
> - Korny
> --
> Kornelis Sietsma  korny at my surname dot com
> .fnord { display: none !important; }
> --
> Sent from:

View raw message