From users-return-49745-archive-asf-public=cust-asf.ponee.io@activemq.apache.org  Mon Mar 26 22:55:05 2018
Return-Path: <users-return-49745-archive-asf-public=cust-asf.ponee.io@activemq.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id E6F32180649
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 26 Mar 2018 22:55:04 +0200 (CEST)
Received: (qmail 99479 invoked by uid 500); 26 Mar 2018 20:55:03 -0000
Mailing-List: contact users-help@activemq.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:users-help@activemq.apache.org>
List-Unsubscribe: <mailto:users-unsubscribe@activemq.apache.org>
List-Post: <mailto:users@activemq.apache.org>
List-Id: <users.activemq.apache.org>
Reply-To: users@activemq.apache.org
Delivered-To: mailing list users@activemq.apache.org
Delivered-To: moderator for users@activemq.apache.org
Received: (qmail 33612 invoked by uid 99); 26 Mar 2018 19:31:41 -0000
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 5.317
X-Spam-Level: *****
X-Spam-Status: No, score=5.317 tagged_above=-999 required=6.31
	tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_ENVFROM_END_DIGIT=0.25,
	KAM_INFOUSMEBIZ=0.75, NML_ADSP_CUSTOM_MED=1.2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972,
	URI_HEX=1.313, URI_TRY_3LD=0.832] autolearn=disabled
Date: Mon, 26 Mar 2018 12:31:38 -0700 (MST)
From: dailyxe <johnwolfgang2013@gmail.com>
To: users@activemq.apache.org
Message-ID: <1522092698770-0.post@n4.nabble.com>
Subject: Unreliable NFS exclusive locks on unreliable networks
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi guys, just wondering if anyone else has tested this and found similar 
problems. 

I've been testing ActiveMQ in a shared storage master/slave configuration, 
using an NFSv4 server for the shared storage.  I've tried this both with a 
standalone nfs server, and using Amazon's EFS server. 

My tests are looking into what happens when the network is unreliable - 
specifically, if for some reason the master ActiveMQ broker can't 
communicate with the NFS server. 

What I've been seeing, in a nutshell, is the following: 

- At startup, the Master gets exclusive access to the NFS lock file, and 
the Slave doesn't, and it loops waiting for the lock, as expected. 

- When I cut the Master off from the NFS server, the NFS server eventually 
times out the lock, and the Slave acquires it and starts up.  It gets a 
pile of journal errors, but it does eventually sort things out and start, 
and clients using the failover: protocol start sending messages to the 
slace. 

- Eventually, the Master notices that it is broken and tries to shut down. 
It takes a long time - I get a lot of warnings like: 
[KeepAlive Timer] INFO  TransportConnection            - The connection to 
'tcp://10.0.12.209:42150' is taking a long time to shutdown. 
... I'm guessing it's trying to gracefully shut down a listener or 
something?  Anyway, eventually I get a DB failure and it dies. 

The problem though, is that the Master re-starts itself - as it should. 
And in the meantime I've repaired the connection to the NFS server.  So the 
master should now try to grab the exclusive lock and fail, and become a 
slave instead. 

However, this generally doesn't seem to happen.  The master restarts, with 
no lock errors, and I have two brokers both thinking they own the same 
NFS-based database.  Not a good situation.  (Once, I had a situation where 
the master did seem to block waiting for a lock, but I haven't been able to 
reproduce that behaviour) 

Has anyone else seen this?  None of this would affect a situation where the 
master broker crashed or was restarted - that should be fine - but it seems 
quite unreliable when a network split occurs, at least from our testing so 
far. 

Note that this may be related to a problem with Java and exclusive file 
locks, which I raised the other day on Stack Overflow: 
http://stackoverflow.com/questions/38397559/is-there-any-way-to-tell-if-a-java-exclusive-filelock-on-an-nfs-share-is-really

the TL;DR is that the FileLock.isValid() check that is used in 
org.apache.activemq.util.LockFile.keepAlive() is pointless - it doesn't 
actually check that the lock is still valid, just that no other thread in 
the same JVM has killed the lock. 

However the LockFile.keepAlive code: 
    public boolean keepAlive() { 
        return lock != null && lock.isValid() && file.exists(); 
    } 
... should still fail, as file.exists() should fail if the NFS server has 
gone away.  (though it's possible this will block rather than failing...) 

- Korny 

-- 
Kornelis Sietsma  korny at my surname dot com http://korny.info
.fnord { display: none !important; } 


--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html