Mailing-List: contact users-help@activemq.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@activemq.apache.org
Received-SPF: pass (athena.apache.org: domain of chubrilo@gmail.com designates
 72.14.220.159 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
        b=ZyYGNtkcKj+ZOyxFqNwEEZaBK0w2grwSh+PeK+Q4f4jxLpP/9/Axl22kAR/23TJtvb
         RGQm3upRVDZmza/HAyKWntIH8Cw2eVQbU0pXpKg9+NkpNaevpWfwSPVpaucyJ6Kop9aJ
         48zujcfYu0eRKQA6PVH25LiopaIe5itcps1uU=
MIME-Version: 1.0
Sender: chubrilo@gmail.com
In-Reply-To: <4BC63485.8080802@e-dialog.com>
References: <4BC4BB59.90303@e-dialog.com>
	 <n2o3a73c17c1004140215za2ba990fkdd9794d68d54cdf6@mail.gmail.com>
	 <4BC5C53E.6090209@e-dialog.com> <4BC5E225.9090803@e-dialog.com>
	 <t2o36e91d9d1004140848lda6a1f2aza23dc563e277f132@mail.gmail.com>
	 <4BC5E60A.8030808@e-dialog.com> <4BC63485.8080802@e-dialog.com>
Date: Thu, 15 Apr 2010 10:31:04 +0200
Message-ID: <h2j36e91d9d1004150131w3ce30feex8e3d9a64d45791bf@mail.gmail.com>
Subject: Re: Messages stuck after Client host reboot
From: Dejan Bosanac <dejan@nighttale.net>
To: jcarlson@e-dialog.com
Cc: "users@activemq.apache.org" <users@activemq.apache.org>,
 Gary Tully <gary.tully@gmail.com>
Content-Type: multipart/alternative; boundary=001636499d03bfbdfd04844250a6

--001636499d03bfbdfd04844250a6
Content-Type: text/plain; charset=ISO-8859-1

Hi Josh,

that config should be

 <transportConnector name="stomp" uri="stomp://mmq1:61613?keepAlive=true"/>

can you try it out and see if it works for you?

Cheers
--
Dejan Bosanac - http://twitter.com/dejanb

Open Source Integration - http://fusesource.com/
ActiveMQ in Action - http://www.manning.com/snyder/
Blog - http://www.nighttale.net


On Wed, Apr 14, 2010 at 11:32 PM, Josh Carlson <jcarlson@e-dialog.com>wrote:

>  Folks ... just because I hate nothing more than coming across a post with
> out a solution, I thought I'd post what I did. After discovering the same
> problem on Solaris as Linux I decided that TCP keepalive might be the
> answer.
>
> Activemq does appear to allow you to set this:
>
>
>       http://activemq.apache.org/tcp-transport-reference.html
>
> However my attempts using STOMP did not work:
>
>    <transportConnector name="stomp" uri="stomp://mmq1:61613?keepAlive"/>
>
> A strace of the JVM shows that the socket option never gets set. AMQ devs,
> should that have worked?
>
> Anyway, so I decided to use LD_PRELOAD to enable keep alive. I downloaded
> this project:
>
>     http://libkeepalive.sourceforge.net/
>
> changed it to interpose accept() as well and it worked like a charm. The
> message gets re-dispatched according to whatever keepalive parameters I have
> set. Lovely. I've submitted my changes to the libkeepalive project owner.
>
> Cheers,
>
> Josh
>
>
> On 04/14/2010 11:58 AM, Josh Carlson wrote:
>
> Hi Dejan,
>
> I don't think it would be practical or correct for us to do that client
> side. The thing that gets me though is that killing the client *process*
> causes the tcp connection to get closed on the other end. But killing client
> *host* keeps the tcp connection  established on the other end. Isn't that a
> kernel bug? Shouldn't it behave the same way in both circumstances?
>
> Cheers
>
> Josh
>
> On 04/14/2010 11:48 AM, Dejan Bosanac wrote:
>
> Hi Josh,
>
>  that's the job of inactivity monitor when using the OpenWire.
> Unfortunately Stomp doesn't support that in version 1.0 and it is something
> we want to add in the next version of the spec. Maybe implementing something
> like that on the application level would help in your case?
>
>  Cheers
> --
> Dejan Bosanac - http://twitter.com/dejanb
>
> Open Source Integration - http://fusesource.com/
> ActiveMQ in Action - http://www.manning.com/snyder/
> Blog - http://www.nighttale.net
>
>
> On Wed, Apr 14, 2010 at 5:41 PM, Josh Carlson <jcarlson@e-dialog.com>wrote:
>
>> Hmm. If a timeout was the solution to this problem how would you be able
>> to tell the difference between something being wrong and the client just
>> being slow.
>>
>> I did an strace on the server and discovered how the timeout is being
>> used. As a parameter to poll
>>
>> 6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 <unfinished
>> ...>
>> 6805  10:31:15 <... poll resumed> )     = 1 ([{fd=94, revents=POLLIN}])
>> 6805  10:31:15 recvfrom(94, "CONNECT\npasscode:...."..., 8192, 0, NULL,
>> NULL) = 39
>> 6805  10:31:15 sendto(94, "CONNECTED\nsession:ID:mmq1-40144-"..., 53, 0,
>> NULL, 0) = 53
>> 6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000) = 1
>> ([{fd=94, revents=POLLIN}])
>> 6805  10:31:15 recvfrom(94, "SUBSCRIBE\nactivemq.prefetchSize:"..., 8192,
>> 0, NULL, NULL) = 138
>> 6805  10:31:15 sendto(94, "RECEIPT\nreceipt-id:39ef0e071a549"..., 55, 0,
>> NULL, 0) = 55
>> 6805  10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 <unfinished
>> ...>
>> 6805  10:32:15 <... poll resumed> )     = 0 (Timeout)
>> 6805  10:32:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 <unfinished
>> ...>
>> 6805  10:33:15 <... poll resumed> )     = 0 (Timeout)
>> 6805  10:33:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000 <unfinished
>> ...>
>> 6805  10:34:15 <... poll resumed> )     = 0 (Timeout)
>>
>> In the output above I stripped lines that were not operations directly on
>> the socket. The poll Timeouts continue on ... with nothing in between.
>>
>> [root@mmq1 tmp]# lsof -p 6755 | grep mmq1
>> java    6755 root   85u  IPv6            1036912                 TCP
>> mmq1.eng.e-dialog.com:61613 (LISTEN)
>> java    6755 root   92u  IPv6            1038039                 TCP
>> mmq1.eng.e-dialog.com:61613->10.0.13.230:46542 (ESTABLISHED)
>> java    6755 root   94u  IPv6            1036997                 TCP
>> mmq1.eng.e-dialog.com:61613->mmd2.eng.e-dialog.com:41743 (ESTABLISHED)
>>
>> The connection to mmd2 is the host that is gone. The one to 10.0.13.230 is
>> up and active. When I kill -9 the process on 10.0.13.230 I see this in the
>> logs:
>>
>> 2010-04-13 17:13:55,322 | DEBUG | Transport failed: java.io.EOFException |
>> org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ
>> Transport: tcp:///10.0.13.230:45463
>> java.io.EOFException
>>        at java.io.DataInputStream.readByte(Unknown Source)
>>        at
>> org.apache.activemq.transport.stomp.StompWireFormat.readLine(StompWireFormat.java:186)
>>        at
>> org.apache.activemq.transport.stomp.StompWireFormat.unmarshal(StompWireFormat.java:94)
>>        at
>> org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:211)
>>        at
>> org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:203)
>>        at
>> org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:186)
>>        at java.lang.Thread.run(Unknown Source)
>> 2010-04-13 17:13:55,325 | DEBUG | Stopping connection: /10.0.13.230:45463| org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>> 2010-04-13 17:13:55,325 | DEBUG | Stopping transport tcp:///
>> 10.0.13.230:45463 | org.apache.activemq.transport.tcp.TcpTransport |
>> ActiveMQ Task
>> 2010-04-13 17:13:55,326 | DEBUG | Stopped transport: /10.0.13.230:45463 |
>> org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>> 2010-04-13 17:13:55,327 | DEBUG | Cleaning up connection resources: /
>> 10.0.13.230:45463 | org.apache.activemq.broker.TransportConnection |
>> ActiveMQ Task
>> 2010-04-13 17:13:55,327 | DEBUG | remove connection id:
>> ID:mmq1-58415-1271193024658-2:3 |
>> org.apache.activemq.broker.TransportConnection | ActiveMQ Task
>> 2010-04-13 17:13:55,328 | DEBUG | masterb1 removing consumer:
>> ID:mmq1-58415-1271193024658-2:3:-1:1 for destination:
>> queue://Producer/TESTING/weight/three/ |
>> org.apache.activemq.broker.region.AbstractRegion | ActiveMQ Task
>>
>> Which is what I want to happen when the host goes down.
>>
>> It seems to be that something should be noticing that the connection is
>> really gone. Maybe this is more of a kernel issue. I would think that when
>> the poll is done that it would trigger the connection to move from the
>> ESTABLISHED state and get closed.
>>
>> We are using Linux, kernel version 2.6.18, but I've seen this same issue
>> on a range of different 2.6 versions.
>>
>> -Josh
>>
>>
>>
>> On 04/14/2010 09:38 AM, Josh Carlson wrote:
>>
>>> Thanks Gary for the, as usual, helpful information.
>>>
>>> It looks like the broker maybe suffering from exactly the same problem
>>> we encountered when implementing client-side failover. Namely that when
>>> the master broker went down a subsequent read on the socket by the
>>> client would hang (well actually take a very long time to fail/timeout).
>>> In that case our TCP connection was ESTABLISHED and looking at the
>>> broker I see the same thing after the client host goes away (the
>>> connection is ESTABLISHED). We fixed this issue in our client by setting
>>> the socket option SO_RCVTIMEO on the connection to the broker.
>>>
>>> I noted what the broker appears to do the same thing with the TCP
>>> transport option soTimeout. It looks like when this is set it winds up
>>> as a call to java.net.Socket.setSoTimeout when the socket is getting
>>> initialized. I have not done any socket programming in Java but my
>>> assumption is that SO_TIMEOUT maps to both SO_RCVTIMEO and SO_SNDTIMEO
>>> in the C world.
>>>
>>> I was hopeful with this option but when I set in in my transport
>>> connector:
>>>
>>> <transportConnector name="stomp"
>>> uri="stomp://mmq1:61613?soTimeout=60000"/>
>>>
>>> the timeout does not occur. I actually ran my test case about 15 hours
>>> ago and I can still see that the broker still has an ESTABLISHED
>>> connection to the dead client and has a message dispatched to it.
>>>
>>> Am I miss understanding what soTimeout is for? I can see in
>>> org.apache.activemq.transport.tcp.TcpTransport.initialiseSocket that
>>> setSoTimeout is getting called unconditionally. So what I'm wondering is
>>> if it is actually calling it with a 0 value despite the way I set up my
>>> transport connector. I suppose setting this to 0 would explain why it
>>> apparently never times out where in our client case it eventually did
>>> timeout (because we were not setting the option at all before).
>>>
>>>
>>>
>>>
>>> On 04/14/2010 05:15 AM, Gary Tully wrote:
>>>
>>>
>>>> The re-dispatch is triggered by the tcp connection dying, netstat can
>>>> help with the diagnosis here. Check the connection state of the broker
>>>> port after the client host is rebooted, if the connection is still
>>>> active, possibly in a timed_wait state, you may need to configure some
>>>> additional timeout options on the broker side.
>>>>
>>>> On 13 April 2010 19:43, Josh Carlson<jcarlson@e-dialog.com
>>>> <mailto:jcarlson@e-dialog.com>>  wrote:
>>>>
>>>>     I am using client acknowledgements with a prefetch size of 1 with
>>>>     no message expiration policy. When a consumer subscribes to a
>>>>     queue I can see that the message gets dispatched correctly. If the
>>>>     process gets killed before retrieving and acknowledging the
>>>>     message I see the message getting re-dispatched (correctly). I
>>>>     expected this same behaviour if the host running the process gets
>>>>     rebooted or crashes. However, after reboot I can see that the
>>>>     message is stuck in the dispatched state to the consumer that is
>>>>     long gone. Is there a way that I can get messages re-dispatched
>>>>     when a host hosting consumer processes gets re-booted? How does it
>>>>     detect the case when a process dies (even with SIGKILL)?
>>>>
>>>>     I did notice that if I increase my prefetch size and enqueue
>>>>     another message after the reboot, that activemq will re-dispatch
>>>>     the original message. However with prefetch size equal to one the
>>>>     message never seems to get re-dispatched.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://blog.garytully.com
>>>>
>>>> Open Source Integration
>>>> http://fusesource.com
>>>>
>>>>
>>>
>

--001636499d03bfbdfd04844250a6--