From dailyxe <>
Subject Unreliable NFS exclusive locks on unreliable networks
Date Mon, 26 Mar 2018 19:31:38 GMT
Hi guys, just wondering if anyone else has tested this and found similar 

I've been testing ActiveMQ in a shared storage master/slave configuration, 
using an NFSv4 server for the shared storage.  I've tried this both with a 
standalone nfs server, and using Amazon's EFS server. 

My tests are looking into what happens when the network is unreliable - 
specifically, if for some reason the master ActiveMQ broker can't 
communicate with the NFS server. 

What I've been seeing, in a nutshell, is the following: 

- At startup, the Master gets exclusive access to the NFS lock file, and 
the Slave doesn't, and it loops waiting for the lock, as expected. 

- When I cut the Master off from the NFS server, the NFS server eventually 
times out the lock, and the Slave acquires it and starts up.  It gets a 
pile of journal errors, but it does eventually sort things out and start, 
and clients using the failover: protocol start sending messages to the 

- Eventually, the Master notices that it is broken and tries to shut down. 
It takes a long time - I get a lot of warnings like: 
[KeepAlive Timer] INFO  TransportConnection            - The connection to 
'tcp://' is taking a long time to shutdown. 
... I'm guessing it's trying to gracefully shut down a listener or 
something?  Anyway, eventually I get a DB failure and it dies. 

The problem though, is that the Master re-starts itself - as it should. 
And in the meantime I've repaired the connection to the NFS server.  So the 
master should now try to grab the exclusive lock and fail, and become a 
slave instead. 

However, this generally doesn't seem to happen.  The master restarts, with 
no lock errors, and I have two brokers both thinking they own the same 
NFS-based database.  Not a good situation.  (Once, I had a situation where 
the master did seem to block waiting for a lock, but I haven't been able to 
reproduce that behaviour) 

Has anyone else seen this?  None of this would affect a situation where the 
master broker crashed or was restarted - that should be fine - but it seems 
quite unreliable when a network split occurs, at least from our testing so 

Note that this may be related to a problem with Java and exclusive file 
locks, which I raised the other day on Stack Overflow:

the TL;DR is that the FileLock.isValid() check that is used in 
org.apache.activemq.util.LockFile.keepAlive() is pointless - it doesn't 
actually check that the lock is still valid, just that no other thread in 
the same JVM has killed the lock. 

However the LockFile.keepAlive code: 
    public boolean keepAlive() { 
        return lock != null && lock.isValid() && file.exists(); 
... should still fail, as file.exists() should fail if the NFS server has 
gone away.  (though it's possible this will block rather than failing...) 

- Korny 

Kornelis Sietsma  korny at my surname dot com
