activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Carlson <jcarl...@e-dialog.com>
Subject Re: Messages stuck after Client host reboot
Date Thu, 15 Apr 2010 14:25:01 GMT
Thanks! Perfixing with 'transport.' worked.

It would be helpful if we were allowed to configure the TCP_KEEPIDLE, 
TCP_KEEPCNT, and TCP_KEEPINTERVL on the underlying socket as well. 
Otherwise one would need to rely on reasonable sysctl settings.

Thanks for the hlep.

Cheers,

-Josh

On 04/15/2010 05:12 AM, Gary Tully wrote:
> Josh, thanks for completing the loop with your reply.
>
> I think the keepAlive option needs to be prefixed with transport. to 
> make it work for the accepted socket.
>
>   stomp://mmq1:61613??transport.keepAlive=true
>
> There is a bit too much going on in the transport configuration with 
> TcpTransport options, socket dot options and transport dot. Easy to 
> get confused but almost all options are settable in some way.
>
>
> On 14 April 2010 22:32, Josh Carlson <jcarlson@e-dialog.com 
> <mailto:jcarlson@e-dialog.com>> wrote:
>
>     Folks ... just because I hate nothing more than coming across a
>     post with out a solution, I thought I'd post what I did. After
>     discovering the same problem on Solaris as Linux I decided that
>     TCP keepalive might be the answer.
>
>     Activemq does appear to allow you to set this:
>
>     http://activemq.apache.org/tcp-transport-reference.html
>
>     However my attempts using STOMP did not work:
>
>     <transportConnector name="stomp" uri="stomp://mmq1:61613?keepAlive"/>
>
>     A strace of the JVM shows that the socket option never gets set.
>     AMQ devs, should that have worked?
>
>     Anyway, so I decided to use LD_PRELOAD to enable keep alive. I
>     downloaded this project:
>
>     http://libkeepalive.sourceforge.net/
>
>     changed it to interpose accept() as well and it worked like a
>     charm. The message gets re-dispatched according to whatever
>     keepalive parameters I have set. Lovely. I've submitted my changes
>     to the libkeepalive project owner.
>
>     Cheers,
>
>     Josh
>
>
>     On 04/14/2010 11:58 AM, Josh Carlson wrote:
>>     Hi Dejan,
>>
>>     I don't think it would be practical or correct for us to do that
>>     client side. The thing that gets me though is that killing the
>>     client *process* causes the tcp connection to get closed on the
>>     other end. But killing client *host* keeps the tcp connection 
>>     established on the other end. Isn't that a kernel bug? Shouldn't
>>     it behave the same way in both circumstances?
>>
>>     Cheers
>>
>>     Josh
>>
>>     On 04/14/2010 11:48 AM, Dejan Bosanac wrote:
>>>     Hi Josh,
>>>
>>>     that's the job of inactivity monitor when using the OpenWire.
>>>     Unfortunately Stomp doesn't support that in version 1.0 and it
>>>     is something we want to add in the next version of the spec.
>>>     Maybe implementing something like that on the application level
>>>     would help in your case?
>>>
>>>     Cheers
>>>     --
>>>     Dejan Bosanac - http://twitter.com/dejanb
>>>
>>>     Open Source Integration - http://fusesource.com/
>>>     ActiveMQ in Action - http://www.manning.com/snyder/
>>>     Blog - http://www.nighttale.net
>>>
>>>
>>>     On Wed, Apr 14, 2010 at 5:41 PM, Josh Carlson
>>>     <jcarlson@e-dialog.com <mailto:jcarlson@e-dialog.com>> wrote:
>>>
>>>         Hmm. If a timeout was the solution to this problem how would
>>>         you be able to tell the difference between something being
>>>         wrong and the client just being slow.
>>>
>>>         I did an strace on the server and discovered how the timeout
>>>         is being used. As a parameter to poll
>>>
>>>         6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1,
>>>         60000 <unfinished ...>
>>>         6805  10:31:15 <... poll resumed> )     = 1 ([{fd=94,
>>>         revents=POLLIN}])
>>>         6805  10:31:15 recvfrom(94, "CONNECT\npasscode:...."...,
>>>         8192, 0, NULL, NULL) = 39
>>>         6805  10:31:15 sendto(94,
>>>         "CONNECTED\nsession:ID:mmq1-40144-"..., 53, 0, NULL, 0) = 53
>>>         6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1,
>>>         60000) = 1 ([{fd=94, revents=POLLIN}])
>>>         6805  10:31:15 recvfrom(94,
>>>         "SUBSCRIBE\nactivemq.prefetchSize:"..., 8192, 0, NULL, NULL)
>>>         = 138
>>>         6805  10:31:15 sendto(94,
>>>         "RECEIPT\nreceipt-id:39ef0e071a549"..., 55, 0, NULL, 0) = 55
>>>         6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1,
>>>         60000 <unfinished ...>
>>>         6805  10:32:15 <... poll resumed> )     = 0 (Timeout)
>>>         6805  10:32:15 poll([{fd=94, events=POLLIN|POLLERR}], 1,
>>>         60000 <unfinished ...>
>>>         6805  10:33:15 <... poll resumed> )     = 0 (Timeout)
>>>         6805  10:33:15 poll([{fd=94, events=POLLIN|POLLERR}], 1,
>>>         60000 <unfinished ...>
>>>         6805  10:34:15 <... poll resumed> )     = 0 (Timeout)
>>>
>>>         In the output above I stripped lines that were not
>>>         operations directly on the socket. The poll Timeouts
>>>         continue on ... with nothing in between.
>>>
>>>         [root@mmq1 tmp]# lsof -p 6755 | grep mmq1
>>>         java    6755 root   85u  IPv6            1036912            
>>>             TCP mmq1.eng.e-dialog.com:61613
>>>         <http://mmq1.eng.e-dialog.com:61613> (LISTEN)
>>>         java    6755 root   92u  IPv6            1038039            
>>>             TCP mmq1.eng.e-dialog.com:61613->10.0.13.230:46542
>>>         <http://10.0.13.230:46542> (ESTABLISHED)
>>>         java    6755 root   94u  IPv6            1036997            
>>>             TCP
>>>         mmq1.eng.e-dialog.com:61613->mmd2.eng.e-dialog.com:41743
>>>         <http://mmd2.eng.e-dialog.com:41743> (ESTABLISHED)
>>>
>>>         The connection to mmd2 is the host that is gone. The one to
>>>         10.0.13.230 is up and active. When I kill -9 the process on
>>>         10.0.13.230 I see this in the logs:
>>>
>>>         2010-04-13 17:13:55,322 | DEBUG | Transport failed:
>>>         java.io.EOFException |
>>>         org.apache.activemq.broker.TransportConnection.Transport |
>>>         ActiveMQ Transport: tcp:///10.0.13.230:45463
>>>         <http://10.0.13.230:45463>
>>>         java.io.EOFException
>>>                at java.io.DataInputStream.readByte(Unknown Source)
>>>                at
>>>         org.apache.activemq.transport.stomp.StompWireFormat.readLine(StompWireFormat.java:186)
>>>                at
>>>         org.apache.activemq.transport.stomp.StompWireFormat.unmarshal(StompWireFormat.java:94)
>>>                at
>>>         org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:211)
>>>                at
>>>         org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:203)
>>>                at
>>>         org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:186)
>>>                at java.lang.Thread.run(Unknown Source)
>>>         2010-04-13 17:13:55,325 | DEBUG | Stopping connection:
>>>         /10.0.13.230:45463 <http://10.0.13.230:45463> |
>>>         org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>>         2010-04-13 17:13:55,325 | DEBUG | Stopping transport
>>>         tcp:///10.0.13.230:45463 <http://10.0.13.230:45463> |
>>>         org.apache.activemq.transport.tcp.TcpTransport | ActiveMQ Task
>>>         2010-04-13 17:13:55,326 | DEBUG | Stopped transport:
>>>         /10.0.13.230:45463 <http://10.0.13.230:45463> |
>>>         org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>>         2010-04-13 17:13:55,327 | DEBUG | Cleaning up connection
>>>         resources: /10.0.13.230:45463 <http://10.0.13.230:45463> |
>>>         org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>>         2010-04-13 17:13:55,327 | DEBUG | remove connection id:
>>>         ID:mmq1-58415-1271193024658-2:3 |
>>>         org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>>>         2010-04-13 17:13:55,328 | DEBUG | masterb1 removing
>>>         consumer: ID:mmq1-58415-1271193024658-2:3:-1:1 for
>>>         destination: queue://Producer/TESTING/weight/three/ |
>>>         org.apache.activemq.broker.region.AbstractRegion | ActiveMQ Task
>>>
>>>         Which is what I want to happen when the host goes down.
>>>
>>>         It seems to be that something should be noticing that the
>>>         connection is really gone. Maybe this is more of a kernel
>>>         issue. I would think that when the poll is done that it
>>>         would trigger the connection to move from the ESTABLISHED
>>>         state and get closed.
>>>
>>>         We are using Linux, kernel version 2.6.18, but I've seen
>>>         this same issue on a range of different 2.6 versions.
>>>
>>>         -Josh
>>>
>>>
>>>
>>>         On 04/14/2010 09:38 AM, Josh Carlson wrote:
>>>
>>>             Thanks Gary for the, as usual, helpful information.
>>>
>>>             It looks like the broker maybe suffering from exactly
>>>             the same problem
>>>             we encountered when implementing client-side failover.
>>>             Namely that when
>>>             the master broker went down a subsequent read on the
>>>             socket by the
>>>             client would hang (well actually take a very long time
>>>             to fail/timeout).
>>>             In that case our TCP connection was ESTABLISHED and
>>>             looking at the
>>>             broker I see the same thing after the client host goes
>>>             away (the
>>>             connection is ESTABLISHED). We fixed this issue in our
>>>             client by setting
>>>             the socket option SO_RCVTIMEO on the connection to the
>>>             broker.
>>>
>>>             I noted what the broker appears to do the same thing
>>>             with the TCP
>>>             transport option soTimeout. It looks like when this is
>>>             set it winds up
>>>             as a call to java.net.Socket.setSoTimeout when the
>>>             socket is getting
>>>             initialized. I have not done any socket programming in
>>>             Java but my
>>>             assumption is that SO_TIMEOUT maps to both SO_RCVTIMEO
>>>             and SO_SNDTIMEO
>>>             in the C world.
>>>
>>>             I was hopeful with this option but when I set in in my
>>>             transport connector:
>>>
>>>             <transportConnector name="stomp"
>>>             uri="stomp://mmq1:61613?soTimeout=60000"/>
>>>
>>>             the timeout does not occur. I actually ran my test case
>>>             about 15 hours
>>>             ago and I can still see that the broker still has an
>>>             ESTABLISHED
>>>             connection to the dead client and has a message
>>>             dispatched to it.
>>>
>>>             Am I miss understanding what soTimeout is for? I can see in
>>>             org.apache.activemq.transport.tcp.TcpTransport.initialiseSocket
>>>             that
>>>             setSoTimeout is getting called unconditionally. So what
>>>             I'm wondering is
>>>             if it is actually calling it with a 0 value despite the
>>>             way I set up my
>>>             transport connector. I suppose setting this to 0 would
>>>             explain why it
>>>             apparently never times out where in our client case it
>>>             eventually did
>>>             timeout (because we were not setting the option at all
>>>             before).
>>>
>>>
>>>
>>>
>>>             On 04/14/2010 05:15 AM, Gary Tully wrote:
>>>
>>>                 The re-dispatch is triggered by the tcp connection
>>>                 dying, netstat can
>>>                 help with the diagnosis here. Check the connection
>>>                 state of the broker
>>>                 port after the client host is rebooted, if the
>>>                 connection is still
>>>                 active, possibly in a timed_wait state, you may need
>>>                 to configure some
>>>                 additional timeout options on the broker side.
>>>
>>>                 On 13 April 2010 19:43, Josh
>>>                 Carlson<jcarlson@e-dialog.com
>>>                 <mailto:jcarlson@e-dialog.com>
>>>                 <mailto:jcarlson@e-dialog.com
>>>                 <mailto:jcarlson@e-dialog.com>>>  wrote:
>>>
>>>                     I am using client acknowledgements with a
>>>                 prefetch size of 1 with
>>>                     no message expiration policy. When a consumer
>>>                 subscribes to a
>>>                     queue I can see that the message gets dispatched
>>>                 correctly. If the
>>>                     process gets killed before retrieving and
>>>                 acknowledging the
>>>                     message I see the message getting re-dispatched
>>>                 (correctly). I
>>>                     expected this same behaviour if the host running
>>>                 the process gets
>>>                     rebooted or crashes. However, after reboot I can
>>>                 see that the
>>>                     message is stuck in the dispatched state to the
>>>                 consumer that is
>>>                     long gone. Is there a way that I can get
>>>                 messages re-dispatched
>>>                     when a host hosting consumer processes gets
>>>                 re-booted? How does it
>>>                     detect the case when a process dies (even with
>>>                 SIGKILL)?
>>>
>>>                     I did notice that if I increase my prefetch size
>>>                 and enqueue
>>>                     another message after the reboot, that activemq
>>>                 will re-dispatch
>>>                     the original message. However with prefetch size
>>>                 equal to one the
>>>                     message never seems to get re-dispatched.
>>>
>>>
>>>
>>>
>>>                 -- 
>>>                 http://blog.garytully.com
>>>
>>>                 Open Source Integration
>>>                 http://fusesource.com
>>>
>>>
>
>
>
> -- 
> http://blog.garytully.com
>
> Open Source Integration
> http://fusesource.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message