From users-return-49745-archive-asf-public=cust-asf.ponee.io@activemq.apache.org Mon Mar 26 22:55:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E6F32180649 for ; Mon, 26 Mar 2018 22:55:04 +0200 (CEST) Received: (qmail 99479 invoked by uid 500); 26 Mar 2018 20:55:03 -0000 Mailing-List: contact users-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@activemq.apache.org Delivered-To: mailing list users@activemq.apache.org Delivered-To: moderator for users@activemq.apache.org Received: (qmail 33612 invoked by uid 99); 26 Mar 2018 19:31:41 -0000 X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.317 X-Spam-Level: ***** X-Spam-Status: No, score=5.317 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_ENVFROM_END_DIGIT=0.25, KAM_INFOUSMEBIZ=0.75, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972, URI_HEX=1.313, URI_TRY_3LD=0.832] autolearn=disabled Date: Mon, 26 Mar 2018 12:31:38 -0700 (MST) From: dailyxe To: users@activemq.apache.org Message-ID: <1522092698770-0.post@n4.nabble.com> Subject: Unreliable NFS exclusive locks on unreliable networks MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi guys, just wondering if anyone else has tested this and found similar problems. I've been testing ActiveMQ in a shared storage master/slave configuration, using an NFSv4 server for the shared storage. I've tried this both with a standalone nfs server, and using Amazon's EFS server. My tests are looking into what happens when the network is unreliable - specifically, if for some reason the master ActiveMQ broker can't communicate with the NFS server. What I've been seeing, in a nutshell, is the following: - At startup, the Master gets exclusive access to the NFS lock file, and the Slave doesn't, and it loops waiting for the lock, as expected. - When I cut the Master off from the NFS server, the NFS server eventually times out the lock, and the Slave acquires it and starts up. It gets a pile of journal errors, but it does eventually sort things out and start, and clients using the failover: protocol start sending messages to the slace. - Eventually, the Master notices that it is broken and tries to shut down. It takes a long time - I get a lot of warnings like: [KeepAlive Timer] INFO TransportConnection - The connection to 'tcp://10.0.12.209:42150' is taking a long time to shutdown. ... I'm guessing it's trying to gracefully shut down a listener or something? Anyway, eventually I get a DB failure and it dies. The problem though, is that the Master re-starts itself - as it should. And in the meantime I've repaired the connection to the NFS server. So the master should now try to grab the exclusive lock and fail, and become a slave instead. However, this generally doesn't seem to happen. The master restarts, with no lock errors, and I have two brokers both thinking they own the same NFS-based database. Not a good situation. (Once, I had a situation where the master did seem to block waiting for a lock, but I haven't been able to reproduce that behaviour) Has anyone else seen this? None of this would affect a situation where the master broker crashed or was restarted - that should be fine - but it seems quite unreliable when a network split occurs, at least from our testing so far. Note that this may be related to a problem with Java and exclusive file locks, which I raised the other day on Stack Overflow: http://stackoverflow.com/questions/38397559/is-there-any-way-to-tell-if-a-java-exclusive-filelock-on-an-nfs-share-is-really the TL;DR is that the FileLock.isValid() check that is used in org.apache.activemq.util.LockFile.keepAlive() is pointless - it doesn't actually check that the lock is still valid, just that no other thread in the same JVM has killed the lock. However the LockFile.keepAlive code: public boolean keepAlive() { return lock != null && lock.isValid() && file.exists(); } ... should still fail, as file.exists() should fail if the NFS server has gone away. (though it's possible this will block rather than failing...) - Korny -- Kornelis Sietsma korny at my surname dot com http://korny.info .fnord { display: none !important; } -- Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html