river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Dolan" <christopher.do...@avid.com>
Subject RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
Date Wed, 04 May 2011 15:47:24 GMT
Here's another more obscure scenario caused by dynamic port allocation.
The server process ends without informing Reggie (OS or JVM crash,
perhaps). On server restart, a non-JERI TCP connection gets assigned
that port number. Then, the client tries to connect to the port and
sends its 8-byte JERI handshake. The client waits for server to respond,
but the server also waits for more bytes from the client because it's
using a different protocol --> deadlock. A timeout would solve break
that deadlock.

I haven't conclusively witnessed that specific deadlock, but I've had a
closely related problem where another process coincidentally grabs port
4160 before Reggie gets it. This happens because Win2k, WinXP and Win2k3
use 1024-5000 for their dynamic port range, contrary to IANA
recommendations. I suspect the deadlock described above happens in real
life, but I've never gotten detailed enough logs to prove it, just
client stack traces showing the hang in Mux.start().

Chris

-----Original Message-----
From: Gregg Wonderly [mailto:greggwon@gmail.com] 
Sent: Tuesday, May 03, 2011 11:49 PM
To: dev@river.apache.org
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Right.  So an obvious issue is a network segmentation event where some
form of working routing is no longer present.  We'd then have normal TCP
timers to deal with, keep-alive not withstanding.  These are not unusual
problems, just well known, but often unexpected.  When networks were
slow and intermittently connected, TCP timeout values on the order of
minutes were sensible (time to redial the modem).  Much shorter
time-outs can be sensible on modern networks as long as you don't exceed
the systems limits to transition through the needed states.

Gregg

Sent from my iPad

On May 3, 2011, at 2:28 AM, Dan Creswell <dan.creswell@gmail.com> wrote:

> Mmm, agreed though I think to do it requires that Chris get to the
> root cause that is producing the need for a client timeout.
> 
> On 2 May 2011 23:17, Gregg Wonderly <gergg@cox.net> wrote:
>> Yes I think a timeout is what will work the best.  I just want to
make sure we can release any associated server resources at the same
time so that this doesn't result in a resource leak in the server.
>> 
>> Gregg
>> 
>> Sent from my iPhone
>> 
>> On May 2, 2011, at 11:27 AM, "Christopher Dolan"
<christopher.dolan@avid.com> wrote:
>> 
>>> I strongly agree  with Gregg's comments about timeouts in general,
but I
>>> think this might be a special case. In the Mux.start() data flow,
the
>>> client sends an 8-byte handshake and expects the server to respond
with
>>> a similar 8-byte handshake (or error) promptly. I'm seeing
indefinite
>>> stalls in real-world cases, so I need to do something and a timeout
is
>>> the only solution I can think of.
>>> 
>>> Chris
>>> 
>>> -----Original Message-----
>>> From: Gregg Wonderly [mailto:gergg@cox.net]
>>> Sent: Saturday, April 30, 2011 10:03 AM
>>> To: dev@river.apache.org
>>> Cc: Mike McGrady
>>> Subject: Re: client hang in
com.sun.jini.jeri.internal.mux.Mux.start()
>>> 
>>> In the history of original Java threading model, and the NIO
development
>>> to use
>>> "select/poll" from your own thread, rather than registering call
back
>>> methods
>>> (via an interface) kept a lot of development from using a model
where
>>> threading
>>> was managed internally by the package or by the JVM.  As a result,
we
>>> have
>>> structures like today where notifications are less common.  In this
code
>>> though,
>>> I think the structure is internal enough that it's not necessary to
>>> really use
>>> Future or some other mechanisms.
>>> 
>>> Timeouts are always a "hard way" to manage "loss of functionality"
>>> because you
>>> really don't know when things are "not working", only that something
is
>>> taking
>>> longer than your timeout accounted for.   Timeouts can make it
possible
>>> for more
>>> pending work to pile up on the other end that might slow the results
>>> even more.
>>>  E.g. you wait 30seconds and retry while the actual work on the
other
>>> end is
>>> taking 35 seconds to get to and thus the queue rate exceeds the
dequeue
>>> rate and
>>> things start piling up.
>>> 
>>> If you are going to use a timeout, we need to have some sort of
>>> indication from
>>> both ends perspective that the attempt has been aborted, as early as
>>> possible.
>>> For cases where I/O traffic is written/read, that usually means
closing
>>> the
>>> socket.  I'm not sure of the ramifications of doing that, since I
>>> haven't looked
>>> too hard at this code.
>>> 
>>> Gregg Wonderly
>>> 
>>> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>>>> Throwing my two cents in here just to state my opinion.  This is an
>>> effort that
>>>> could pay dividends if it were done with a view toward the future -
>>> absolutely
>>>> no pun intended.  I do not know the details of the code but if
>>> futures could be
>>>> useful here, they would be welcomed by myself.
>>>> 
>>>> MG
>>>> 
>>>> 
>>>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>>>> 
>>>>> Thanks, Tom.
>>>>> 
>>>>> I don't really understand your Future suggestion.  Are you
suggesting
>>>>> changing the async handshake to a Future? If so, that sounds like
a
>>> very
>>>>> involved change, touching a lot of code in Mux and its subclasses.
>>>>> 
>>>>> setDown() changes the value of the muxDown boolean to true, so
it's a
>>>>> valid way out of the loop.
>>>>> 
>>>>> Chris
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>>>> Sent: Friday, April 29, 2011 2:27 PM
>>>>> To: dev@river.apache.org
>>>>> Subject: Re: client hang in
>>> com.sun.jini.jeri.internal.mux.Mux.start()
>>>>> 
>>>>> The proposed code looks fine to me.  Two points leap out, more
>>>>> discussion
>>>>> points than anything else.
>>>>> 
>>>>> For some reason I've recently developed an aversion to writing my
on
>>>>> timeout
>>>>> logic.  did you consider using something like a Future here or
might
>>>>> that be
>>>>> serious overkill (it wouldn't surprise me if it was)?
>>>>> 
>>>>> Also is Setdown intended to break out of the while loop?  Because
I
>>>>> can't
>>>>> see a way to escape it.  (I don't have the rest of the code in
front
>>> of
>>>>> me)
>>>>> 
>>>>> Thanks for keep raising these issues - especially because you
usually
>>>>> supply
>>>>> fixes for them!
>>>>> 
>>>>> Tom
>>>>> 
>>>>> On 29 Apr 2011 19:41, "Christopher
Dolan"<christopher.dolan@avid.com>
>>>>> wrote:
>>>>>> I've experienced occasional cases where clients get stuck in the
>>>>>> following block of code in Mux.start. Has anyone experienced this
>>>>>> problem? I have a proposed solution below. Has anyone thought
about
>>> a
>>>>>> similar solution already?
>>>>>> 
>>>>>> -- Current code --
>>>>>> 1 asyncSendClientConnectionHeader();
>>>>>> 2 synchronized (muxLock) {
>>>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 4 try {
>>>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>>>> 6 } catch (InterruptedException e) {
>>>>>> 7 ...
>>>>>> 8 }
>>>>>> 9 }
>>>>>> 10 if (muxDown) {
>>>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>>>> 12 ioe.initCause(muxDownCause);
>>>>>> 13 throw ioe;
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 
>>>>>> -- Explanation of the code --
>>>>>> This code handles the initial client-server handshake that starts
a
>>>>> JERI
>>>>>> connection. In line 1, the client sends its 8-byte greeting to
the
>>>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>>>> response. If the reader thread gets a satisfactory response from
the
>>>>>> server, it sets clientConnectionReady=true and calls
>>>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>>>> mismatched
>>>>>> protocol version, etc) the reader invokes Mux.setDown() which
sets
>>>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
>>> throws
>>>>> if
>>>>>> the handshake was a failure.
>>>>>> 
>>>>>> In my scenario (which uses simple TCP sockets, nothing fancy),
the
>>>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>>>> reproduce, so I haven't found out what the server is doing in
this
>>>>> case.
>>>>>> I hope to figure that out eventually, but presently I'm
interested
>>> in
>>>>>> the "REMIND: timeout?" comment.
>>>>>> 
>>>>>> -- Timeout solution --
>>>>>> It seems obvious to me that there should be a timeout here. There
>>> are
>>>>>> lots of imaginable cases where the client could get stuck here:
>>>>>> server-side deadlock, abrupt server crash, logic error in client
Mux
>>>>>> code. You'd expect that the server would either respond with its
>>>>> 8-byte
>>>>>> handshake very quickly or never, so a modest timeout (like 15 or
30
>>>>>> seconds) should be good. If that timeout is triggered, I would
>>> expect
>>>>>> that the code above would call Mux.setDown() and throw an
>>> IOException.
>>>>>> That exception would either cause a retry or be thrown up to the
>>>>> invoker
>>>>>> as a RemoteException.
>>>>>> 
>>>>>> -- Proposed code (untested) --
>>>>>> 3 long now = System.currentTimeMillis();
>>>>>> 4 long endTime = now + timeoutMillis;
>>>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 6 if (now>= endTime) {
>>>>>> 7 setDown("timeout waiting for server to respond
>>>>>> to handshake", null);
>>>>>> 8 } else {
>>>>>> 9 try {
>>>>>> 10 muxLock.wait(endTime - now);
>>>>>> 11 now = System.currentTimeMillis();
>>>>>> 12 } catch (InterruptedException e) {
>>>>>> 13 setDown("interrupt waiting for connection
>>>>>> header", e);
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 16 }
>>>>>> 
>>>>>> This code assumes a configurable timeoutMillis parameter has been
>>> set
>>>>>> earlier.
>>>>>> 
>>>>>> I can't think of any alternative solutions. Putting the timeout
in
>>> the
>>>>>> Reader logic seems higher risk. There's incomplete code in JERI
to
>>>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
>>> that
>>>>>> would only be relevant after the initial handshake and wouldn't
help
>>>>>> here.
>>>>>> 
>>>>>> Thanks,
>>>>>> Chris
>>>>>> 
>>>> 
>>>> Michael McGrady
>>>> Chief Systems Architect
>>>> Topia Technology, Inc.
>>>> Cel 1.253.720.3365
>>>> Work 1.253.572.9712 extension 2037
>>>> Skype ID: michael.mcgrady5
>>>> mmcgrady@topiatechnology.com
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 

Mime
View raw message