activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Riekenberg (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AMQ-2149) Shared Filesystem Master Slave: missing messages
Date Thu, 12 Mar 2009 23:16:39 GMT

    [ https://issues.apache.org/activemq/browse/AMQ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=50487#action_50487
] 

Aaron Riekenberg commented on AMQ-2149:
---------------------------------------

Dejan -

Based on your comments, I tried a couple of tests.  In these tests I was running FUSE message
broker 5.3.0.0.  I did not set the prefetch size, so it had the default value.  I did comment
out the entire <systemUsage> stanza of the broker's configuration.

1. I tried running with the default broker shutdown rate of 20 seconds as in my original test,
to test the effect of removing <systemUsage> only.  This failed after 3 broker failovers.
 The activemq log for this run is attached as activemq.log.2009_03_12_1

{{Thu Mar 12 17:59:15 CDT 2009 started master broker pid 28845}}
{{Thu Mar 12 17:59:25 CDT 2009 started slave broker pid 29069}}
{{Thu Mar 12 17:59:35 CDT 2009 killing master broker pid 28845, new master pid 29069}}
{{Thu Mar 12 17:59:55 CDT 2009 started slave broker pid 29285}}
{{Thu Mar 12 18:00:15 CDT 2009 killing master broker pid 29069, new master pid 29285}}
{{Thu Mar 12 18:00:35 CDT 2009 started slave broker pid 29515}}
{{Thu Mar 12 18:00:55 CDT 2009 killing master broker pid 29285, new master pid 29515}}

{{Mar 12, 2009 6:00:56 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect}}
{{INFO: Successfully reconnected to tcp://localhost:61616}}
{{Mar 12, 2009 6:00:56 PM org.aaron.MasterSlaveTest$Receiver onMessage}}
{{WARNING: test.queue.8 received 520 expected 2712}}


2. Then I modified the script so it kills brokers every 60 seconds.  This also failed after
3 broker failovers.  The activemq log for this run is attached as activemq.log.2009_03_12_2

{{Thu Mar 12 18:03:34 CDT 2009 started master broker pid 29871}}
{{Thu Mar 12 18:03:44 CDT 2009 started slave broker pid 30090}}
{{Thu Mar 12 18:04:44 CDT 2009 killing master broker pid 29871, new master pid 30090}}
{{Thu Mar 12 18:05:44 CDT 2009 started slave broker pid 30402}}
{{Thu Mar 12 18:06:44 CDT 2009 killing master broker pid 30090, new master pid 30402}}
{{Thu Mar 12 18:07:44 CDT 2009 started slave broker pid 30725}}
{{Thu Mar 12 18:08:44 CDT 2009 killing master broker pid 30402, new master pid 30725}}

{{Mar 12, 2009 6:08:46 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect}}
{{INFO: Successfully reconnected to tcp://localhost:61616}}
{{Mar 12, 2009 6:08:46 PM org.aaron.MasterSlaveTest$Receiver onMessage}}
{{WARNING: test.queue.5 received 1049 expected 3205}}


> Shared Filesystem Master Slave: missing messages
> ------------------------------------------------
>
>                 Key: AMQ-2149
>                 URL: https://issues.apache.org/activemq/browse/AMQ-2149
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.2.0
>         Environment: Ubuntu Linux 8.10 AMD64, Sun JDK 1.6.0.10
>            Reporter: Aaron Riekenberg
>         Attachments: activemq.log, activemq.log.2009_03_12_1, activemq.log.2009_03_12_2,
activemq.xml, MasterSlaveTest.java, MasterSlaveTestWithTransactions.java, run_master_slave_brokers.sh
>
>
> I'm finding occasionally messages are not delivered in order in a shared filesystem master
slave setup when the master fails and the slave takes over.  I'm running a simple test on
one physical machine where the shared filesystem is on a single disk (no SAN currently involved).
> I'm attaching a shell script (run_master_slave_brokers.sh) that starts a master and slave
broker in the same directory, sleeps 20 seconds, kills the master, sleeps 20 seconds, starts
a new slave, sleeps 20 seconds, kills the master, etc.
> Also attached is a small java test program (MasterSlaveTest.java)  The program starts
10 JMS senders that send 75kb text messages every 25 ms to unique queues.  These messages
contain a sequence number header (a long).  The program also starts 10 receivers (1 for each
queue) that keep track of the next expected sequence number and validate each incoming sequence
number.  If a receiver gets an unexpected sequence number, the test program exits (System.exit(1)).
 Both the senders and receivers use the failover transport to connect to the broker.  Messages
being sent are persistent, so in theory there should be no message loss when the master fails
and slave takes over.
> I run the script to start the brokers, then run my test program.  Most times when the
script kills the master and the slave is promoted, things work fine - the test program reconnects,
and messages continue to be delivered in order.  If I run this long enough though, eventually
my test program fails just after a slave broker is promoted to master with output similar
to this:
> Mar 6, 2009 11:58:12 AM org.apache.activemq.transport.failover.FailoverTransport doReconnect
> INFO: Successfully reconnected to tcp://localhost:61616
> Mar 6, 2009 11:58:12 AM org.aaron.MasterSlaveTest$Receiver onMessage
> WARNING: test.queue.3 received 630 expected 629
> This indicates the receiver for test.queue.3 received message 630 after the slave broker
took over and missed message 629.
> This seems to happen more often when more senders and receivers are running and more
queues are in use.  If I run a single sender/receiver pair on 1 queue, it is very difficult
to make this happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message