Return-Path: Delivered-To: apmail-activemq-dev-archive@www.apache.org Received: (qmail 67482 invoked from network); 9 Mar 2009 20:50:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Mar 2009 20:50:10 -0000 Received: (qmail 7744 invoked by uid 500); 9 Mar 2009 20:50:09 -0000 Delivered-To: apmail-activemq-dev-archive@activemq.apache.org Received: (qmail 7720 invoked by uid 500); 9 Mar 2009 20:50:09 -0000 Mailing-List: contact dev-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@activemq.apache.org Delivered-To: mailing list dev@activemq.apache.org Received: (qmail 7709 invoked by uid 99); 9 Mar 2009 20:50:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2009 13:50:09 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2009 20:50:01 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id ED805234C045 for ; Mon, 9 Mar 2009 13:49:39 -0700 (PDT) Message-ID: <970970334.1236631779971.JavaMail.jira@brutus> Date: Mon, 9 Mar 2009 13:49:39 -0700 (PDT) From: "Dave Stanley (JIRA)" To: dev@activemq.apache.org Subject: [jira] Commented: (AMQ-2149) Shared Filesystem Master Slave: missing messages In-Reply-To: <1847668314.1236362684677.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: ae95407df07c98740808b2ef9da0087c X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/activemq/browse/AMQ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=50373#action_50373 ] Dave Stanley commented on AMQ-2149: ----------------------------------- It seems whats happening is that the default prefetch of 1000 is being used. When the active master is killed, there are unack'd inflight messages on the broker->consumer connection. At that point all bets are off in terms of the message order, as order will not be guaranteed when the messages are put back on the queue. If you set the prefetch to 1, everything looks like it works correctly, for example: failover:(tcp://localhost:61616)?maxReconnectDelay=1000&jms.prefetchPolicy.queuePrefetch=1&useExponentialBackOff=false I think if you added some buffering (vs the System.exit()) to the test to handle out of order messages, the consumer will get the messages - albeit out of order. Regards /Dave > Shared Filesystem Master Slave: missing messages > ------------------------------------------------ > > Key: AMQ-2149 > URL: https://issues.apache.org/activemq/browse/AMQ-2149 > Project: ActiveMQ > Issue Type: Bug > Affects Versions: 5.2.0 > Environment: Ubuntu Linux 8.10 AMD64, Sun JDK 1.6.0.10 > Reporter: Aaron Riekenberg > Attachments: activemq.log, activemq.xml, MasterSlaveTest.java, MasterSlaveTestWithTransactions.java, run_master_slave_brokers.sh > > > I'm finding occasionally messages are not delivered in order in a shared filesystem master slave setup when the master fails and the slave takes over. I'm running a simple test on one physical machine where the shared filesystem is on a single disk (no SAN currently involved). > I'm attaching a shell script (run_master_slave_brokers.sh) that starts a master and slave broker in the same directory, sleeps 20 seconds, kills the master, sleeps 20 seconds, starts a new slave, sleeps 20 seconds, kills the master, etc. > Also attached is a small java test program (MasterSlaveTest.java) The program starts 10 JMS senders that send 75kb text messages every 25 ms to unique queues. These messages contain a sequence number header (a long). The program also starts 10 receivers (1 for each queue) that keep track of the next expected sequence number and validate each incoming sequence number. If a receiver gets an unexpected sequence number, the test program exits (System.exit(1)). Both the senders and receivers use the failover transport to connect to the broker. Messages being sent are persistent, so in theory there should be no message loss when the master fails and slave takes over. > I run the script to start the brokers, then run my test program. Most times when the script kills the master and the slave is promoted, things work fine - the test program reconnects, and messages continue to be delivered in order. If I run this long enough though, eventually my test program fails just after a slave broker is promoted to master with output similar to this: > Mar 6, 2009 11:58:12 AM org.apache.activemq.transport.failover.FailoverTransport doReconnect > INFO: Successfully reconnected to tcp://localhost:61616 > Mar 6, 2009 11:58:12 AM org.aaron.MasterSlaveTest$Receiver onMessage > WARNING: test.queue.3 received 630 expected 629 > This indicates the receiver for test.queue.3 received message 630 after the slave broker took over and missed message 629. > This seems to happen more often when more senders and receivers are running and more queues are in use. If I run a single sender/receiver pair on 1 queue, it is very difficult to make this happen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.