activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Tully (JIRA)" <>
Subject [jira] Commented: (AMQ-1993) Systems hang due to inability to timeout socket write operation
Date Mon, 03 Nov 2008 15:15:05 GMT


Gary Tully commented on AMQ-1993:

Am not sure it is safer because the filter introduces a change of behaviour to the normal
exception case. Ie: onException is now always called.
In addition, in the event that a close is done async from an onException, there is still an
opportunity to have a normal IOException interleaved with a Forced exception.
I think this is the same as with a pass through on exception, a close can get called twice,
but this is handled ok by close.
Mostly though, I am wary of the change in behaviour introduced by the exception handler.
As this is a filter that is added by choice it is not such a big deal but we may as well iron
out the detail. This is a handy feature. 

> Systems hang due to inability to timeout socket write operation
> ---------------------------------------------------------------
>                 Key: AMQ-1993
>                 URL:
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 5.1.0, 5.2.0
>         Environment: Unix (Solaris and Linux tested)
>            Reporter: Filip Hanik
>            Assignee: Gary Tully
>            Priority: Critical
>         Attachments: patch-1-threadname-filter.patch, patch-3-tcp-writetimeout.patch
> the blocking Java Socket API doesn't have a timeout on socketWrite invocations.
> This means, if a TCP session is dropped or terminated without RST or FIN packets, the
operating system it left to eventually time out the session. On the linux kernel this timeout
usually takes 15 to 30minutes. 
> For this entire period, the AMQ server hangs, and producers and consumers are unable
to use a topic.
> I have created two patches for this at the page:
> Let me show a bit more
> ---------------------------------
> "ActiveMQ Transport: tcp:///X.YYY.XXX.ZZZZ:2011" daemon prio=10 tid=0x0000000055d39000
nid=0xc78 runnable [0x00000000447c9000..0x00000000447cac10]
>    java.lang.Thread.State: RUNNABLE
> 	at Method)
> 	at
> This is a thread stuck in blocking IO, and can be stuck for 30 minutes during the kernel
TCP retransmission attempts.
> Unfortunately the thread dump is very misleading since the name of the thread, is not
the destination or even remotely related to the socket it is operating on.
> To mend this, a very simple (and configurable) ThreadNameFilter has been suggested to
the patch, that appends the destination and helps the system administrator correctly identify
the client that is about to receive data. 
> -----------------------------------
> 	at
> 	at
> 	- locked <0x00002aaaec155818> (a
> 	at
> The lock being held at this issue unfortunately makes the entire Topic single threaded.

> When this lock is being held, no other clients (producers and consumers) can publish
to/receive from this topic.
> And this lock can hold up to 30 minutes.
> I consider solving this single threaded behavior a 'feature enhancement' that should
be handled separately from this bug. Because even if it is solved, threads still risk being
stuck in socketWrite0 for dropped connections that still appear to be established.
> For this, I have implemented a socket timeout filter, based on a TransportFilter, this
filter only times out connections that are actually writing data.
> The two patches are at:
> the binary 0000.jar applies to both 5.1 and trunk and can be used today in existing environments.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message