activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Carlson (JIRA)" <j...@apache.org>
Subject [jira] Created: (AMQ-2627) Failover causes duplicate messages
Date Wed, 24 Feb 2010 19:00:40 GMT
Failover causes duplicate messages
----------------------------------

                 Key: AMQ-2627
                 URL: https://issues.apache.org/activemq/browse/AMQ-2627
             Project: ActiveMQ
          Issue Type: Bug
          Components: Broker
    Affects Versions: 5.3.0
         Environment: Server: 2 RHEL 5.3 x86-64 machines. Kernel version 2.6.18-128.0.0.0.2.el5.
Client: Same as above. Also tested with same results on Fedora Core 11
            Reporter: Josh Carlson
            Priority: Blocker


When using a shared file system master/server activemq configuration and client acknoledgements
we run into a problem when
our clients fail over to a new server. The problem is that the new server does not appear
to have any knowledge of pending
messages that the old server had dispatched to clients. Consequently all of these pending
messages get dispatched a second
time even though the clients had acknowledged them.

Please confirm my suspicion that this is a server side bug and if there are any suggestions
for working around this issue so that it might work. I have put this at Priority 'Blocker'
because it blocks our progress towards deploying an ActiveMQ solution to our infrastructure.


If you look at the log file from the new broker you can see that the ack for those messages
do not get matched:

   2010-02-24 12:46:49,759 | WARN  | Async error occurred: javax.jms.JMSException: Unmatched
acknowledege:

I do not know whether this gets bubbled up to the client or not. If it does it must be under
the hood in activemq-cpp
because from the application layer I do not see any errors. In our in house Perl Stomp client
we wind up getting an ERROR
frame which it did not know what to do with. This is where I intially ran into this problem.
Today is my first day using
CMS to attempt to verify if the bug is independent of the client and to provide a reproducer
using a client everyone
should have ready access to.

The attached tar file will contain the following details for reproducing this problem.

Contents:

   README.txt                   - This File
   activemq_1.xml               - ActiveMQ config for the server that was master at the time
I started the consumer
   activemq_2.xml               - ActiveMQ config for the broker which became the master after
the original master failed
   activemq_1.log               - Log file from the first server
   activemq_2.log               - Log for the second server
   producers/SimpleProducer.cpp - Modified version of program shipped in activemq-cpp-library-3.1.0
to
                                  send only 2 messages and provide two broker hosts on the
command line.
   consumers/SimpleConsumer.cpp - New file ... but really just a modified version of SimpleAsyncConsumer
shipped with
                                  activemq-cpp-library-3.1.0. Modified as follows:
                                     - Retrieves messages synchronously and in one thread
(so we can see what is going on)
                                     - Takes two command line options to name broker hosts
to use in broker URI
                                     - Uses Client Acknoledgements.
                                     - After retrieving a message it blocks waiting for standard
input (so one has time to go kill the server)
    Makefile.am                 - Modified version of the makefile to build the new SimpleConsumer
program.
    
    
Note that the build for these files require that they be built from inside a activemq-cpp
build tree. So the first step to reproduce this problem would be to copy producers/SimpleProducer.cpp
consumers/SimpleConsumer.cpp and Makefile.am to your src/examples directory. Then run a top
level, configure and make. I ran this using activemq-cpp-library version 3.1.0
    
This reproducer expects that you only have 2 activemq brokers and that they be configured
using a shared file system master/slave configuration. It also expects an openwire transport
connector listening on port 61616 on those two machines. (Note: you'll see my activemq configs
using the transport uri: uri="tcp://q1masterhost:61616", q1masterhost goes to the ethernet
0 interface on each of the hosts.)

Once you have those two brokers set up and running. Go ahead and run the simple_producer code
passing the hostnames of your two brokers on the command line:

        [jcarlson@rocky examples]$ ./simple_producer mmq1 mmq2
        =====================================================
        Starting the example:
        -----------------------------------------------------
        Sent message #1 from thread 139817389041504
        Sent message #2 from thread 139817389041504
        -----------------------------------------------------
        Finished with the example.
        =====================================================

Now do the same for the simple_consumer:

        [jcarlson@rocky examples]$ ./simple_consumer mmq1 mmq2
        =====================================================
        Starting the example:
        -----------------------------------------------------
        Message #1 Received: Hello world! from thread 139817389041504
        Waiting for stdin to acknoledge

The app has retrieved one message but has not ack'ed it yet. Now go identify
which host has the master broker and kill the process. The master broker will
be the one which is *not* printing 'Database [lockfile] is locked' messages.

In my case the broker was on mmq1 so I did this in another terminal:

        ssh -t mmq1 sudo pkill java

Immediatly I see this in the console I started the consumer in:

  The Connection's Transport has been Interrupted.

and then a few seconds later I see:

  The Connection's Transport has been Restored.

At this point I hit enter in the terminal so that the message I recieved on
the other broker gets acknoledged and the consumer trys to get another message

  Message #2 Received: Hello world! from thread 139817389041504
  Waiting for stdin to acknoledge

Ok at this point, since I have only put two messages on the queue I don't
expect any more so when I hit enter and go back to get another message I
expect it to just sit and wait for another message to come in. This is not
what happens. A third message is retrieved:

  Message #3 Received: Hello world! from thread 139817389041504
  Waiting for stdin to acknoledge

At this point when I hit enter again the app blocks and I kill it with Cntrl
C.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message