activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Torsten Mielke (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMQ-5568) deleting lock file on broker shut down can take a master broker down
Date Fri, 06 Feb 2015 13:02:34 GMT

     [ https://issues.apache.org/jira/browse/AMQ-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Torsten Mielke updated AMQ-5568:
--------------------------------
    Description: 
This problem may only occur on a shared file system master/slave setup. 
I can reproduce reliably on a NFSv4 mount using a persistence adapter configuration like 

{code}
<levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
  <locker>
    <shared-file-locker lockAcquireSleepInterval="10000"/>
  </locker>
</levelDB>
{code}

However the problem is also reproducible using kahaDB.
Two broker instances competing for the lock on the shared storage (e.g. leveldb or kahadb).
Lets say brokerA becomes master, broker B slave.

If brokerA looses access to the NFS share, it will shut down. As part of shutting down, it
tries delete the lock file of the persistence adapter. Now since the NFS share is gone, all
file i/o calls hang for a good while before returning errors. As such the broker shut down
gets delayed.

In the meantime the slave broker B (not affected by the NFS problem) grabs the lock and becomes
master.

If the NFS mount is restored while broker A (the previous master) still hangs on the file
i/o operations (as part of its shutdown routine), the attempt to delete the persistence adapter
lock file will finally succeed and broker A shuts down. 

Deleting the lock file however also affects the new master broker B who periodically runs
a keepAlive() check on the lock. That check verifies the file still exists and the FileLock
is still valid. As the lock file got deleted, keepAlive() fails on broker B and that broker
shuts down as well. 
The overall result is that both broker instances have shut down despite an initially successful
failover.

Using restartAllowed=true is not an option either as this can cause other problems in an NFS
based master/slave setup.


  was:
This problem may only occur on a shared file system master/slave setup. 
I can reproduce reliably on a NFSv4 mount using a persistence adapter configuration like 

{code}
<levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
  <locker>
    <shared-file-locker lockAcquireSleepInterval="10000"/>
  </locker>
</levelDB>
{code}

However the problem is also reproducible using kahaDB.
Two broker instances competing for the lock on the shared storage (e.g. leveldb or kahadb).
Lets say brokerA becomes master, broker B slave.

If brokerA looses access to the NFS share, it will shut down. As part of shutting down, it
tries delete the lock file of the persistence adapter. Now since the NFS share is gone, all
file i/o calls hang for a good while before returning errors. 

In the meantime the slave broker B (not affected by the NFS problem) grabs the lock and becomes
master.

If the NFS mount is restored while broker A (the previous master) still hangs on the file
i/o operations (as part of its shutdown routine), the attempt to delete the lock file will
finally succeed and broker A shuts down. 

Deleting the lock file however also affects the new master broker B who periodically runs
a keepAlive() check on the lock. That check verifies the file still exists and the FileLock
is still valid. As the lock got deleted keepAlive fails on broker B and that broker shuts
down as well. 
The overall result is that both broker instances have shut down.

Using restartAllowed=true is not an option either as this can cause other problems in an NFS
based master/slave setup.



> deleting lock file on broker shut down can take a master broker down
> --------------------------------------------------------------------
>
>                 Key: AMQ-5568
>                 URL: https://issues.apache.org/jira/browse/AMQ-5568
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker, Message Store
>    Affects Versions: 5.11.0
>            Reporter: Torsten Mielke
>              Labels: persistence
>
> This problem may only occur on a shared file system master/slave setup. 
> I can reproduce reliably on a NFSv4 mount using a persistence adapter configuration like

> {code}
> <levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
>   <locker>
>     <shared-file-locker lockAcquireSleepInterval="10000"/>
>   </locker>
> </levelDB>
> {code}
> However the problem is also reproducible using kahaDB.
> Two broker instances competing for the lock on the shared storage (e.g. leveldb or kahadb).
Lets say brokerA becomes master, broker B slave.
> If brokerA looses access to the NFS share, it will shut down. As part of shutting down,
it tries delete the lock file of the persistence adapter. Now since the NFS share is gone,
all file i/o calls hang for a good while before returning errors. As such the broker shut
down gets delayed.
> In the meantime the slave broker B (not affected by the NFS problem) grabs the lock and
becomes master.
> If the NFS mount is restored while broker A (the previous master) still hangs on the
file i/o operations (as part of its shutdown routine), the attempt to delete the persistence
adapter lock file will finally succeed and broker A shuts down. 
> Deleting the lock file however also affects the new master broker B who periodically
runs a keepAlive() check on the lock. That check verifies the file still exists and the FileLock
is still valid. As the lock file got deleted, keepAlive() fails on broker B and that broker
shuts down as well. 
> The overall result is that both broker instances have shut down despite an initially
successful failover.
> Using restartAllowed=true is not an option either as this can cause other problems in
an NFS based master/slave setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message