activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Schneider <ch...@die-schneider.net>
Subject Re: Failover very slow with kahadb while restart of master is fast
Date Tue, 15 Mar 2016 12:54:29 GMT
I now took thread dumps of activemq while it was taking so long to 
start  up.

This is where it seemed to spend most of the time:
http://apaste.info/H1c

So it looks like it is reading the kahadb journals. Does anyone have an 
idea why it seems to matter if the master restarts (fast) or the dlave 
takes over (slow).
After all the files are in the same remote location in both cases.

Christian

# at 
org.apache.activemq.util.RecoverableRandomAccessFile.readFully(RecoverableRandomAccessFile.java:75)
# at 
org.apache.activemq.store.kahadb.disk.journal.DataFileAccessor.readFully(DataFileAccessor.java:98)
# at 
org.apache.activemq.store.kahadb.disk.journal.Journal.checkBatchRecord(Journal.java:302)
# at 
org.apache.activemq.store.kahadb.disk.journal.Journal.recoveryCheck(Journal.java:210)
# at 
org.apache.activemq.store.kahadb.disk.journal.Journal.start(Journal.java:170)
# -locked <0x000000070086c700>(a 
org.apache.activemq.store.kahadb.disk.journal.Journal)
# at 
org.apache.activemq.store.kahadb.MessageDatabase.open(MessageDatabase.java:378)



On 15.02.2016 15:15, Tim Bain wrote:
> You could use a profiler such as JVisualVM to tell you what work is being
> done.  You could get some of that information from thread dumps via jstack,
> but I think a profiler would give much more useful and detailed information
> for less effort.
>
> Tim
> On Feb 15, 2016 5:37 AM, "Jean-Baptiste Onofré" <jb@nanthrax.net> wrote:
>
>> Hi Christian,
>>
>> what's the option on the NFS ?
>>
>> Most of the time, I use I use sync and I increase the rsize and wsize.
>>
>> Anyway, the difference is huge. The problem could be related to the
>> journal checksum check. Can you try checkForCorruptJournalFiles="false" on
>> the kahadb configuration ?
>>
>> Maybe increasing the kahadb jobs could help too:
>>
>>              <kahaDB directory="${karaf.data}/activemq/activemq/kahadb"
>>                  indexWriteBatchSize="1000"
>>                  indexCacheSize="2000"
>>                  journalMaxFileLength="32mb"
>>                  checkForCorruptJournalFiles="false"
>>                  maxAsyncJobs="5000"
>>                  concurrentStoreAndDispatchQueues="true"
>>                  concurrentStoreAndDispatchTopics="true"
>>                  enableJournalDiskSyncs="true"
>>                  enableIndexWriteAsync="true"/>
>>
>>
>> Regards
>> JB
>>
>> On 02/15/2016 01:32 PM, Christian Schneider wrote:
>>
>>> At a customer we are experiencing a very strange behaviour of activemq.
>>>
>>> We have two brokers on separate machines A and B. Storage is on a third
>>> machine and imported using nfsv4.
>>> Kahadb contains about 18 GB of journal files.
>>>
>>> Scenario 1:
>>> - Stop B
>>> - Stop A (master now)
>>> - Start A
>>>
>>> The start of A takes about 50 seconds.
>>>
>>> Scnario 2:
>>> - Stop A
>>> - Stop B (master now)
>>> - Start A
>>>
>>> The start of A takes about 5 minutes.
>>>
>>> In the slow case the log at debug level shows a big gap between adding
>>> the last queue (12:04) and running the ActiveMQ Journal Checkpoint
>>> Worker again after the queue creation (12:09).
>>> I have no idea what activemq is doing in the mean time but it seems to
>>> take several minutes.
>>>
>>> Can anyone explain this behaviour or explain what activemq does in the
>>> mean time? I would also be happy about pointer to the code.
>>> What I do not understand is why it makes a difference if the master is
>>> started again or if the slave is becoming master. As they both share the
>>> same kahadb over nfs I would expect the same startup
>>> behaviour.
>>>
>>> Christian
>>>
>>> -----
>>>
>>> Slow start
>>> 2016-02-15 12:03:01,943 | INFO  | Refreshing
>>> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
>>> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
>>> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
>>> ...
>>> 2016-02-15 12:03:03,018 | INFO  |
>>> PListStore:[/path/activemq/data/A/tmp_storage] started |
>>> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
>>> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
>>> initialised journal data file: db-1.log number = 1 , length = 0 |
>>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>>> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler
>>> started |
>>> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
>>> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
>>> path/scheduler | org.apache.activemq.broker.BrokerService | main
>>> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
>>> KahaDBPersistenceAdapter[/path] |
>>> org.apache.activemq.broker.BrokerService | main
>>> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
>>> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
>>> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
>>> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
>>> 'path/db-1122.log' between offsets: 17893102..17893732 |
>>> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>>> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
>>> org.apache.activemq.store.kahadb.MessageDatabase | main
>>> ....
>>> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
>>> org.apache.activemq.store.kahadb.MessageDatabase | main
>>> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from
>>> the journal in 0.094 seconds. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | main
>>> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
>>> ID:A-59167-1455534183070-1:1) is starting |
>>> org.apache.activemq.broker.BrokerService | main
>>> ...
>>> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
>>> org.apache.activemq.broker.region.AbstractRegion | main
>>> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
>>> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>>> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
>>> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
>>> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
>>> ActiveMQ Broker[A] Scheduler
>>> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
>>> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
>>> Checkpoint Worker
>>> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done.
>>> | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>


-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message