Mailing-List: contact dev-help@activemq.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@activemq.apache.org
Date: Fri, 30 Jan 2015 13:08:34 +0000 (UTC)
From: "Torsten Mielke (JIRA)" <jira@apache.org>
To: dev@activemq.apache.org
Message-ID: <JIRA.12770925.1422515616000.214913.1422623314717@Atlassian.JIRA>
In-Reply-To: <JIRA.12770925.1422515616000@Atlassian.JIRA>
References: <JIRA.12770925.1422515616000@Atlassian.JIRA>
 <JIRA.12770925.1422515616883@arcas>
Subject: [jira] [Commented] (AMQ-5549) Shared Filesystem Master/Slave using
 NFSv4 allows both brokers become active at the same time
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/AMQ-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298603#comment-14298603 ] 

Torsten Mielke commented on AMQ-5549:
-------------------------------------

I am not suggesting ActiveMQ requires a soft NFS mount. Just noticed we got NFS errors propagated much quicker using soft mounts. 

Yes, the fix for ENTMQ-391 will be needed and its contained in 5.9.0. See AMQ-4705.

In my tests I shut down the nic of the nfs client machine that runs the broker and tested how quickly this resulted in an error on the master broker and how quickly a slave broker running on a different machine takes over. 

With the NFS options 

{code}
timeo=50,retrans=1,soft,noac
{code}

and the previously suggested broker configuration the master broker would raise an exception within 15 secs after loosing access to the NFS share and would shutdown within another 1-2 minutes. During the shutdown the broker tries to close all file pointing to the persistence store and that close() call hangs too and needs to timeout as well.
It took about 60 - 80 seconds for the slave broker to take over. 

Previously testing with default NFS mount options the master broker would some times not shut down within 10+ minutes. 

I took various thread dumps along the way and the broker was always hung in a Java I/O operation that took a long time to finally raise an exception. 
Was able to reproduce the same behavior using a very simple Java application that only tries the same Java I/O. So IMHO the entire issue is really down to configuring NFS in a way that it quickly raises errors to the application stack.
 

> Shared Filesystem Master/Slave using NFSv4 allows both brokers become active at the same time
> ---------------------------------------------------------------------------------------------
>
>                 Key: AMQ-5549
>                 URL: https://issues.apache.org/jira/browse/AMQ-5549
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker, Message Store
>    Affects Versions: 5.10.1
>         Environment: - CentOS Linux 6
> - OpenJDK 1.7
> - ActiveMQ 5.10.1
>            Reporter: Heikki Manninen
>            Priority: Critical
>
> Identical ActiveMQ master and slave brokers are installed on CentOS Linux 6 virtual machines. There is a third virtual machine (also CentOS 6) providing an NFSv4 share for the brokers KahaDB.
> Both brokers are started and the master broker acquires file lock on the lock file and the slave broker sits in a loop and waits for a lock as expected. Also changing brokers work as expected.
> Once the network connection of the NFS server is disconnected both master and slave NFS mounts block and slave broker stops logging file lock re-tries. After a short while after bringing the network connection back the mounts come back and the slave broker is able to acquire the lock simultaneously. Both brokers accept client connections.
> In this situation it is also possible to stop and start both individual brokers many times and they are always able to acquire the lock even if the other one is already running. Only after stopping both brokers and starting them again is the situation back to normal.
> * NFS server:
> ** CentOS Linux 6
> ** NFS v4 export options: rw,sync
> ** NFS v4 grace time 45 seconds
> ** NFS v4 lease time 10 seconds
> * NFS client:
> ** CentOS Linux 6
> ** NFS mount options: nfsvers=4,proto=tcp,hard,wsize=65536,rsize=65536
> * ActiveMQ configuration (otherwise default):
> {code:xml}
>         <persistenceAdapter>
>             <kahaDB directory="${activemq.data}/kahadb">
>               <locker>
>                 <shared-file-locker lockAcquireSleepInterval="1000"/>
>               </locker>
>             </kahaDB>
>         </persistenceAdapter>
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)