activemq-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Volker Kleinschmidt (JIRA)" <>
Subject [jira] [Commented] (AMQ-6005) Slave broker startup corrupts shared PList storage
Date Wed, 21 Oct 2015 19:21:27 GMT


Volker Kleinschmidt commented on AMQ-6005:

Also, to replicate the source of the problem, you can simply hand-create empty,
tmpDB.redo, and db-1.log files in the shared tmp_storage folder, then restart one of the slave
nodes while the master broker is still running - you will see that those files get deleted
by the slave startup, which would corrupt the master broker's tmpDB if it were currently using
it. So for issue replication it's not necessary to actually create enough asynchronous message
load to be using tmp_storage - it's the tmpDB deletion by a *slave* that is the problem.

> Slave broker startup corrupts shared PList storage
> --------------------------------------------------
>                 Key: AMQ-6005
>                 URL:
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: KahaDB
>    Affects Versions: 5.7.0, 5.10.0
>         Environment: RHLinux6
>            Reporter: Volker Kleinschmidt
> h4. Background
> When multiple JVMs run AMQ in a master/slave configuration with the broker directory
in a shared filesystem location (as is required e.g. for kahaPersistence), and when due to
high message volume or slow producers the broker's memory needs exceed the configured memory
usage limit, AMQ will overflow asynchronous messages to a PList store inside the "tmp_storage"
subdirectory of said shared broker directory.
> h4. Issue
> We frequently observed this tmpDB store getting corrupted with "stale NFS filehandle"
errors for, tmpDB.redo, and some journal files, all of which suddenly went missing
from the tmp_storage folder. This puts the entire broker into a bad state from which it cannot
recover. Only restarting the service (which causes a broker slave to take over and loses the
yet-undelivered messages) gets a working state back.
> h4. Symptoms
> Stack trace:
> {noformat}
> ...
> Caused by: Stale file handle
> 	at Method)
> 	at
> 	at
> 	at
> 	at
> 	at
> 	at$2.readPage(
> 	at$2.<init>(
> 	at
> 	at
> 	at
> 	at org.apache.kahadb.index.ListIndex.loadNode(
> 	at org.apache.kahadb.index.ListIndex.getHead(
> 	at org.apache.kahadb.index.ListIndex.iterator(
> 	at$PListIterator.<init>(
> 	at
> 	at$DiskIterator.<init>(
> {noformat}
> h4. Cause
> During BrokerThread startup, the BrokerService.startPersistenceAdapter() method is called,
which  via doStartPersistenceAdapter() and getProducerSystemUsage() invokes getSystemUsage(),
that calls getTempDataStore(), and that method summarily cleans out the existing contents
of the tmp_storage directory.
> All of this happens *before* the broker lock is obtained in the PersistenceAdapter.start()
method at the end of doStartPersistenceAdapter().
> So a JVM that doesn't get to be the broker (because there already is one) and runs in
slave mode (waiting to obtain the broker lock) interferes with and corrupts the running broker's
tmp_storage and thus breaks the broker. That's a critical bug. The slave has no business starting
up the persistence adapter and cleaning out data as it hasn't gotten the lock yet, so isn't
allowed to do any work, period. 
> h4. Workaround
> As workaround, an unshared local directory needs to be specified as tempDirectory for
the broker, even if the main broker directory is shared. Also, since broker startup will clear
the tmp_storage out anyway, there really is no advantage to having this in a shared location
- since the next broker that starts up after a broker failure will never re-use that data

This message was sent by Atlassian JIRA

View raw message