Return-Path: Delivered-To: apmail-activemq-users-archive@www.apache.org Received: (qmail 40902 invoked from network); 15 Apr 2010 08:31:32 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Apr 2010 08:31:32 -0000 Received: (qmail 18783 invoked by uid 500); 15 Apr 2010 08:31:31 -0000 Delivered-To: apmail-activemq-users-archive@activemq.apache.org Received: (qmail 18549 invoked by uid 500); 15 Apr 2010 08:31:31 -0000 Mailing-List: contact users-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@activemq.apache.org Delivered-To: mailing list users@activemq.apache.org Received: (qmail 18541 invoked by uid 99); 15 Apr 2010 08:31:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2010 08:31:30 +0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chubrilo@gmail.com designates 72.14.220.159 as permitted sender) Received: from [72.14.220.159] (HELO fg-out-1718.google.com) (72.14.220.159) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2010 08:31:26 +0000 Received: by fg-out-1718.google.com with SMTP id 22so700429fge.14 for ; Thu, 15 Apr 2010 01:31:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:received:message-id:subject :from:to:cc:content-type; bh=az9HYfBHIBiwI4K9Ig6OUmlqfJ5bYKP9lJadZAQf5Wc=; b=ZIQzOB8GTQ1pX/x8o5ahnmWyeblUJG98n391HyoHxJr6MGVU1q/gBtJRp3LeJtFCKC KrmLo89F8xDV6ZsuvSaGfdMGaMntRDQ170f2WEJMngkkibAucJY+fJUL5AfO4IhHQKrv KfgCYmvCmqR578J4TSDc+4poJWZ6Egwlm5b90= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=ZyYGNtkcKj+ZOyxFqNwEEZaBK0w2grwSh+PeK+Q4f4jxLpP/9/Axl22kAR/23TJtvb RGQm3upRVDZmza/HAyKWntIH8Cw2eVQbU0pXpKg9+NkpNaevpWfwSPVpaucyJ6Kop9aJ 48zujcfYu0eRKQA6PVH25LiopaIe5itcps1uU= MIME-Version: 1.0 Sender: chubrilo@gmail.com Received: by 10.239.153.205 with HTTP; Thu, 15 Apr 2010 01:31:04 -0700 (PDT) In-Reply-To: <4BC63485.8080802@e-dialog.com> References: <4BC4BB59.90303@e-dialog.com> <4BC5C53E.6090209@e-dialog.com> <4BC5E225.9090803@e-dialog.com> <4BC5E60A.8030808@e-dialog.com> <4BC63485.8080802@e-dialog.com> Date: Thu, 15 Apr 2010 10:31:04 +0200 X-Google-Sender-Auth: adab4eb58757b78b Received: by 10.239.142.20 with SMTP id e20mr468041hba.70.1271320264293; Thu, 15 Apr 2010 01:31:04 -0700 (PDT) Message-ID: Subject: Re: Messages stuck after Client host reboot From: Dejan Bosanac To: jcarlson@e-dialog.com Cc: "users@activemq.apache.org" , Gary Tully Content-Type: multipart/alternative; boundary=001636499d03bfbdfd04844250a6 --001636499d03bfbdfd04844250a6 Content-Type: text/plain; charset=ISO-8859-1 Hi Josh, that config should be can you try it out and see if it works for you? Cheers -- Dejan Bosanac - http://twitter.com/dejanb Open Source Integration - http://fusesource.com/ ActiveMQ in Action - http://www.manning.com/snyder/ Blog - http://www.nighttale.net On Wed, Apr 14, 2010 at 11:32 PM, Josh Carlson wrote: > Folks ... just because I hate nothing more than coming across a post with > out a solution, I thought I'd post what I did. After discovering the same > problem on Solaris as Linux I decided that TCP keepalive might be the > answer. > > Activemq does appear to allow you to set this: > > > http://activemq.apache.org/tcp-transport-reference.html > > However my attempts using STOMP did not work: > > > > A strace of the JVM shows that the socket option never gets set. AMQ devs, > should that have worked? > > Anyway, so I decided to use LD_PRELOAD to enable keep alive. I downloaded > this project: > > http://libkeepalive.sourceforge.net/ > > changed it to interpose accept() as well and it worked like a charm. The > message gets re-dispatched according to whatever keepalive parameters I have > set. Lovely. I've submitted my changes to the libkeepalive project owner. > > Cheers, > > Josh > > > On 04/14/2010 11:58 AM, Josh Carlson wrote: > > Hi Dejan, > > I don't think it would be practical or correct for us to do that client > side. The thing that gets me though is that killing the client *process* > causes the tcp connection to get closed on the other end. But killing client > *host* keeps the tcp connection established on the other end. Isn't that a > kernel bug? Shouldn't it behave the same way in both circumstances? > > Cheers > > Josh > > On 04/14/2010 11:48 AM, Dejan Bosanac wrote: > > Hi Josh, > > that's the job of inactivity monitor when using the OpenWire. > Unfortunately Stomp doesn't support that in version 1.0 and it is something > we want to add in the next version of the spec. Maybe implementing something > like that on the application level would help in your case? > > Cheers > -- > Dejan Bosanac - http://twitter.com/dejanb > > Open Source Integration - http://fusesource.com/ > ActiveMQ in Action - http://www.manning.com/snyder/ > Blog - http://www.nighttale.net > > > On Wed, Apr 14, 2010 at 5:41 PM, Josh Carlson wrote: > >> Hmm. If a timeout was the solution to this problem how would you be able >> to tell the difference between something being wrong and the client just >> being slow. >> >> I did an strace on the server and discovered how the timeout is being >> used. As a parameter to poll >> >> 6805 10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 > ...> >> 6805 10:31:15 <... poll resumed> ) = 1 ([{fd=94, revents=POLLIN}]) >> 6805 10:31:15 recvfrom(94, "CONNECT\npasscode:...."..., 8192, 0, NULL, >> NULL) = 39 >> 6805 10:31:15 sendto(94, "CONNECTED\nsession:ID:mmq1-40144-"..., 53, 0, >> NULL, 0) = 53 >> 6805 10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000) = 1 >> ([{fd=94, revents=POLLIN}]) >> 6805 10:31:15 recvfrom(94, "SUBSCRIBE\nactivemq.prefetchSize:"..., 8192, >> 0, NULL, NULL) = 138 >> 6805 10:31:15 sendto(94, "RECEIPT\nreceipt-id:39ef0e071a549"..., 55, 0, >> NULL, 0) = 55 >> 6805 10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 > ...> >> 6805 10:32:15 <... poll resumed> ) = 0 (Timeout) >> 6805 10:32:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 > ...> >> 6805 10:33:15 <... poll resumed> ) = 0 (Timeout) >> 6805 10:33:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 > ...> >> 6805 10:34:15 <... poll resumed> ) = 0 (Timeout) >> >> In the output above I stripped lines that were not operations directly on >> the socket. The poll Timeouts continue on ... with nothing in between. >> >> [root@mmq1 tmp]# lsof -p 6755 | grep mmq1 >> java 6755 root 85u IPv6 1036912 TCP >> mmq1.eng.e-dialog.com:61613 (LISTEN) >> java 6755 root 92u IPv6 1038039 TCP >> mmq1.eng.e-dialog.com:61613->10.0.13.230:46542 (ESTABLISHED) >> java 6755 root 94u IPv6 1036997 TCP >> mmq1.eng.e-dialog.com:61613->mmd2.eng.e-dialog.com:41743 (ESTABLISHED) >> >> The connection to mmd2 is the host that is gone. The one to 10.0.13.230 is >> up and active. When I kill -9 the process on 10.0.13.230 I see this in the >> logs: >> >> 2010-04-13 17:13:55,322 | DEBUG | Transport failed: java.io.EOFException | >> org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ >> Transport: tcp:///10.0.13.230:45463 >> java.io.EOFException >> at java.io.DataInputStream.readByte(Unknown Source) >> at >> org.apache.activemq.transport.stomp.StompWireFormat.readLine(StompWireFormat.java:186) >> at >> org.apache.activemq.transport.stomp.StompWireFormat.unmarshal(StompWireFormat.java:94) >> at >> org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:211) >> at >> org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:203) >> at >> org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:186) >> at java.lang.Thread.run(Unknown Source) >> 2010-04-13 17:13:55,325 | DEBUG | Stopping connection: /10.0.13.230:45463| org.apache.activemq.broker.TransportConnection | ActiveMQ Task >> 2010-04-13 17:13:55,325 | DEBUG | Stopping transport tcp:/// >> 10.0.13.230:45463 | org.apache.activemq.transport.tcp.TcpTransport | >> ActiveMQ Task >> 2010-04-13 17:13:55,326 | DEBUG | Stopped transport: /10.0.13.230:45463 | >> org.apache.activemq.broker.TransportConnection | ActiveMQ Task >> 2010-04-13 17:13:55,327 | DEBUG | Cleaning up connection resources: / >> 10.0.13.230:45463 | org.apache.activemq.broker.TransportConnection | >> ActiveMQ Task >> 2010-04-13 17:13:55,327 | DEBUG | remove connection id: >> ID:mmq1-58415-1271193024658-2:3 | >> org.apache.activemq.broker.TransportConnection | ActiveMQ Task >> 2010-04-13 17:13:55,328 | DEBUG | masterb1 removing consumer: >> ID:mmq1-58415-1271193024658-2:3:-1:1 for destination: >> queue://Producer/TESTING/weight/three/ | >> org.apache.activemq.broker.region.AbstractRegion | ActiveMQ Task >> >> Which is what I want to happen when the host goes down. >> >> It seems to be that something should be noticing that the connection is >> really gone. Maybe this is more of a kernel issue. I would think that when >> the poll is done that it would trigger the connection to move from the >> ESTABLISHED state and get closed. >> >> We are using Linux, kernel version 2.6.18, but I've seen this same issue >> on a range of different 2.6 versions. >> >> -Josh >> >> >> >> On 04/14/2010 09:38 AM, Josh Carlson wrote: >> >>> Thanks Gary for the, as usual, helpful information. >>> >>> It looks like the broker maybe suffering from exactly the same problem >>> we encountered when implementing client-side failover. Namely that when >>> the master broker went down a subsequent read on the socket by the >>> client would hang (well actually take a very long time to fail/timeout). >>> In that case our TCP connection was ESTABLISHED and looking at the >>> broker I see the same thing after the client host goes away (the >>> connection is ESTABLISHED). We fixed this issue in our client by setting >>> the socket option SO_RCVTIMEO on the connection to the broker. >>> >>> I noted what the broker appears to do the same thing with the TCP >>> transport option soTimeout. It looks like when this is set it winds up >>> as a call to java.net.Socket.setSoTimeout when the socket is getting >>> initialized. I have not done any socket programming in Java but my >>> assumption is that SO_TIMEOUT maps to both SO_RCVTIMEO and SO_SNDTIMEO >>> in the C world. >>> >>> I was hopeful with this option but when I set in in my transport >>> connector: >>> >>> >> uri="stomp://mmq1:61613?soTimeout=60000"/> >>> >>> the timeout does not occur. I actually ran my test case about 15 hours >>> ago and I can still see that the broker still has an ESTABLISHED >>> connection to the dead client and has a message dispatched to it. >>> >>> Am I miss understanding what soTimeout is for? I can see in >>> org.apache.activemq.transport.tcp.TcpTransport.initialiseSocket that >>> setSoTimeout is getting called unconditionally. So what I'm wondering is >>> if it is actually calling it with a 0 value despite the way I set up my >>> transport connector. I suppose setting this to 0 would explain why it >>> apparently never times out where in our client case it eventually did >>> timeout (because we were not setting the option at all before). >>> >>> >>> >>> >>> On 04/14/2010 05:15 AM, Gary Tully wrote: >>> >>> >>>> The re-dispatch is triggered by the tcp connection dying, netstat can >>>> help with the diagnosis here. Check the connection state of the broker >>>> port after the client host is rebooted, if the connection is still >>>> active, possibly in a timed_wait state, you may need to configure some >>>> additional timeout options on the broker side. >>>> >>>> On 13 April 2010 19:43, Josh Carlson>>> > wrote: >>>> >>>> I am using client acknowledgements with a prefetch size of 1 with >>>> no message expiration policy. When a consumer subscribes to a >>>> queue I can see that the message gets dispatched correctly. If the >>>> process gets killed before retrieving and acknowledging the >>>> message I see the message getting re-dispatched (correctly). I >>>> expected this same behaviour if the host running the process gets >>>> rebooted or crashes. However, after reboot I can see that the >>>> message is stuck in the dispatched state to the consumer that is >>>> long gone. Is there a way that I can get messages re-dispatched >>>> when a host hosting consumer processes gets re-booted? How does it >>>> detect the case when a process dies (even with SIGKILL)? >>>> >>>> I did notice that if I increase my prefetch size and enqueue >>>> another message after the reboot, that activemq will re-dispatch >>>> the original message. However with prefetch size equal to one the >>>> message never seems to get re-dispatched. >>>> >>>> >>>> >>>> >>>> -- >>>> http://blog.garytully.com >>>> >>>> Open Source Integration >>>> http://fusesource.com >>>> >>>> >>> > --001636499d03bfbdfd04844250a6--