activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Carlson <jcarl...@e-dialog.com>
Subject Re: Messages stuck after Client host reboot
Date Wed, 14 Apr 2010 21:32:53 GMT
Folks ... just because I hate nothing more than coming across a post 
with out a solution, I thought I'd post what I did. After discovering 
the same problem on Solaris as Linux I decided that TCP keepalive might 
be the answer.

Activemq does appear to allow you to set this:

       http://activemq.apache.org/tcp-transport-reference.html

However my attempts using STOMP did not work:

<transportConnector name="stomp" uri="stomp://mmq1:61613?keepAlive"/>

A strace of the JVM shows that the socket option never gets set. AMQ 
devs, should that have worked?

Anyway, so I decided to use LD_PRELOAD to enable keep alive. I 
downloaded this project:

     http://libkeepalive.sourceforge.net/

changed it to interpose accept() as well and it worked like a charm. The 
message gets re-dispatched according to whatever keepalive parameters I 
have set. Lovely. I've submitted my changes to the libkeepalive project 
owner.

Cheers,

Josh

On 04/14/2010 11:58 AM, Josh Carlson wrote:
> Hi Dejan,
>
> I don't think it would be practical or correct for us to do that 
> client side. The thing that gets me though is that killing the client 
> *process* causes the tcp connection to get closed on the other end. 
> But killing client *host* keeps the tcp connection  established on the 
> other end. Isn't that a kernel bug? Shouldn't it behave the same way 
> in both circumstances?
>
> Cheers
>
> Josh
>
> On 04/14/2010 11:48 AM, Dejan Bosanac wrote:
>> Hi Josh,
>>
>> that's the job of inactivity monitor when using the OpenWire. 
>> Unfortunately Stomp doesn't support that in version 1.0 and it is 
>> something we want to add in the next version of the spec. Maybe 
>> implementing something like that on the application level would help 
>> in your case?
>>
>> Cheers
>> --
>> Dejan Bosanac - http://twitter.com/dejanb
>>
>> Open Source Integration - http://fusesource.com/
>> ActiveMQ in Action - http://www.manning.com/snyder/
>> Blog - http://www.nighttale.net
>>
>>
>> On Wed, Apr 14, 2010 at 5:41 PM, Josh Carlson <jcarlson@e-dialog.com 
>> <mailto:jcarlson@e-dialog.com>> wrote:
>>
>>     Hmm. If a timeout was the solution to this problem how would you
>>     be able to tell the difference between something being wrong and
>>     the client just being slow.
>>
>>     I did an strace on the server and discovered how the timeout is
>>     being used. As a parameter to poll
>>
>>     6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
>>     <unfinished ...>
>>     6805  10:31:15 <... poll resumed> )     = 1 ([{fd=94,
>>     revents=POLLIN}])
>>     6805  10:31:15 recvfrom(94, "CONNECT\npasscode:...."..., 8192, 0,
>>     NULL, NULL) = 39
>>     6805  10:31:15 sendto(94, "CONNECTED\nsession:ID:mmq1-40144-"...,
>>     53, 0, NULL, 0) = 53
>>     6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000) =
>>     1 ([{fd=94, revents=POLLIN}])
>>     6805  10:31:15 recvfrom(94,
>>     "SUBSCRIBE\nactivemq.prefetchSize:"..., 8192, 0, NULL, NULL) = 138
>>     6805  10:31:15 sendto(94, "RECEIPT\nreceipt-id:39ef0e071a549"...,
>>     55, 0, NULL, 0) = 55
>>     6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
>>     <unfinished ...>
>>     6805  10:32:15 <... poll resumed> )     = 0 (Timeout)
>>     6805  10:32:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
>>     <unfinished ...>
>>     6805  10:33:15 <... poll resumed> )     = 0 (Timeout)
>>     6805  10:33:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000
>>     <unfinished ...>
>>     6805  10:34:15 <... poll resumed> )     = 0 (Timeout)
>>
>>     In the output above I stripped lines that were not operations
>>     directly on the socket. The poll Timeouts continue on ... with
>>     nothing in between.
>>
>>     [root@mmq1 tmp]# lsof -p 6755 | grep mmq1
>>     java    6755 root   85u  IPv6            1036912                
>>     TCP mmq1.eng.e-dialog.com:61613
>>     <http://mmq1.eng.e-dialog.com:61613> (LISTEN)
>>     java    6755 root   92u  IPv6            1038039                
>>     TCP mmq1.eng.e-dialog.com:61613->10.0.13.230:46542
>>     <http://10.0.13.230:46542> (ESTABLISHED)
>>     java    6755 root   94u  IPv6            1036997                
>>     TCP mmq1.eng.e-dialog.com:61613->mmd2.eng.e-dialog.com:41743
>>     <http://mmd2.eng.e-dialog.com:41743> (ESTABLISHED)
>>
>>     The connection to mmd2 is the host that is gone. The one to
>>     10.0.13.230 is up and active. When I kill -9 the process on
>>     10.0.13.230 I see this in the logs:
>>
>>     2010-04-13 17:13:55,322 | DEBUG | Transport failed:
>>     java.io.EOFException |
>>     org.apache.activemq.broker.TransportConnection.Transport |
>>     ActiveMQ Transport: tcp:///10.0.13.230:45463
>>     <http://10.0.13.230:45463>
>>     java.io.EOFException
>>            at java.io.DataInputStream.readByte(Unknown Source)
>>            at
>>     org.apache.activemq.transport.stomp.StompWireFormat.readLine(StompWireFormat.java:186)
>>            at
>>     org.apache.activemq.transport.stomp.StompWireFormat.unmarshal(StompWireFormat.java:94)
>>            at
>>     org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:211)
>>            at
>>     org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:203)
>>            at
>>     org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:186)
>>            at java.lang.Thread.run(Unknown Source)
>>     2010-04-13 17:13:55,325 | DEBUG | Stopping connection:
>>     /10.0.13.230:45463 <http://10.0.13.230:45463> |
>>     org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>     2010-04-13 17:13:55,325 | DEBUG | Stopping transport
>>     tcp:///10.0.13.230:45463 <http://10.0.13.230:45463> |
>>     org.apache.activemq.transport.tcp.TcpTransport | ActiveMQ Task
>>     2010-04-13 17:13:55,326 | DEBUG | Stopped transport:
>>     /10.0.13.230:45463 <http://10.0.13.230:45463> |
>>     org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>     2010-04-13 17:13:55,327 | DEBUG | Cleaning up connection
>>     resources: /10.0.13.230:45463 <http://10.0.13.230:45463> |
>>     org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>     2010-04-13 17:13:55,327 | DEBUG | remove connection id:
>>     ID:mmq1-58415-1271193024658-2:3 |
>>     org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>     2010-04-13 17:13:55,328 | DEBUG | masterb1 removing consumer:
>>     ID:mmq1-58415-1271193024658-2:3:-1:1 for destination:
>>     queue://Producer/TESTING/weight/three/ |
>>     org.apache.activemq.broker.region.AbstractRegion | ActiveMQ Task
>>
>>     Which is what I want to happen when the host goes down.
>>
>>     It seems to be that something should be noticing that the
>>     connection is really gone. Maybe this is more of a kernel issue.
>>     I would think that when the poll is done that it would trigger
>>     the connection to move from the ESTABLISHED state and get closed.
>>
>>     We are using Linux, kernel version 2.6.18, but I've seen this
>>     same issue on a range of different 2.6 versions.
>>
>>     -Josh
>>
>>
>>
>>     On 04/14/2010 09:38 AM, Josh Carlson wrote:
>>
>>         Thanks Gary for the, as usual, helpful information.
>>
>>         It looks like the broker maybe suffering from exactly the
>>         same problem
>>         we encountered when implementing client-side failover. Namely
>>         that when
>>         the master broker went down a subsequent read on the socket
>>         by the
>>         client would hang (well actually take a very long time to
>>         fail/timeout).
>>         In that case our TCP connection was ESTABLISHED and looking
>>         at the
>>         broker I see the same thing after the client host goes away (the
>>         connection is ESTABLISHED). We fixed this issue in our client
>>         by setting
>>         the socket option SO_RCVTIMEO on the connection to the broker.
>>
>>         I noted what the broker appears to do the same thing with the TCP
>>         transport option soTimeout. It looks like when this is set it
>>         winds up
>>         as a call to java.net.Socket.setSoTimeout when the socket is
>>         getting
>>         initialized. I have not done any socket programming in Java
>>         but my
>>         assumption is that SO_TIMEOUT maps to both SO_RCVTIMEO and
>>         SO_SNDTIMEO
>>         in the C world.
>>
>>         I was hopeful with this option but when I set in in my
>>         transport connector:
>>
>>         <transportConnector name="stomp"
>>         uri="stomp://mmq1:61613?soTimeout=60000"/>
>>
>>         the timeout does not occur. I actually ran my test case about
>>         15 hours
>>         ago and I can still see that the broker still has an ESTABLISHED
>>         connection to the dead client and has a message dispatched to it.
>>
>>         Am I miss understanding what soTimeout is for? I can see in
>>         org.apache.activemq.transport.tcp.TcpTransport.initialiseSocket
>>         that
>>         setSoTimeout is getting called unconditionally. So what I'm
>>         wondering is
>>         if it is actually calling it with a 0 value despite the way I
>>         set up my
>>         transport connector. I suppose setting this to 0 would
>>         explain why it
>>         apparently never times out where in our client case it
>>         eventually did
>>         timeout (because we were not setting the option at all before).
>>
>>
>>
>>
>>         On 04/14/2010 05:15 AM, Gary Tully wrote:
>>
>>             The re-dispatch is triggered by the tcp connection dying,
>>             netstat can
>>             help with the diagnosis here. Check the connection state
>>             of the broker
>>             port after the client host is rebooted, if the connection
>>             is still
>>             active, possibly in a timed_wait state, you may need to
>>             configure some
>>             additional timeout options on the broker side.
>>
>>             On 13 April 2010 19:43, Josh
>>             Carlson<jcarlson@e-dialog.com <mailto:jcarlson@e-dialog.com>
>>             <mailto:jcarlson@e-dialog.com
>>             <mailto:jcarlson@e-dialog.com>>>  wrote:
>>
>>                 I am using client acknowledgements with a prefetch
>>             size of 1 with
>>                 no message expiration policy. When a consumer
>>             subscribes to a
>>                 queue I can see that the message gets dispatched
>>             correctly. If the
>>                 process gets killed before retrieving and
>>             acknowledging the
>>                 message I see the message getting re-dispatched
>>             (correctly). I
>>                 expected this same behaviour if the host running the
>>             process gets
>>                 rebooted or crashes. However, after reboot I can see
>>             that the
>>                 message is stuck in the dispatched state to the
>>             consumer that is
>>                 long gone. Is there a way that I can get messages
>>             re-dispatched
>>                 when a host hosting consumer processes gets
>>             re-booted? How does it
>>                 detect the case when a process dies (even with SIGKILL)?
>>
>>                 I did notice that if I increase my prefetch size and
>>             enqueue
>>                 another message after the reboot, that activemq will
>>             re-dispatch
>>                 the original message. However with prefetch size
>>             equal to one the
>>                 message never seems to get re-dispatched.
>>
>>
>>
>>
>>             -- 
>>             http://blog.garytully.com
>>
>>             Open Source Integration
>>             http://fusesource.com
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message