httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Slemko <>
Subject Re: Socket and Protocol error in error_log file (fwd)
Date Sun, 19 Jan 1997 19:51:28 GMT

---------- Forwarded message ----------
Date: Sun, 19 Jan 1997 09:11:27 -0800
From: Mukesh Kacker <mukesh.kacker@Eng.Sun.COM>
Cc: mukesh.kacker@Eng.Sun.COM
Subject: Re: VERY URGENT!!!!! Socket and Protocol error in error_log file

> I have forwarded this to the development mailing list for comments.

> The change would have to be wrapped in a mess of ifdefs,
> since not all
> platforms define EPROTO

That is certainly true. BSD did not have EPROTO. It is a system V thing.
SunOS 4.x has EPROTO possibly because it had already started adopting
System V things even before SunOS 5.x (Solaris 2.x). [ Not that SunOS 4.x
accept() will ever return EPROTO, but it will be harmless dead code ]

> (and possibly ECONNABORTED).

BSD has always had ECONNABORTED so this one is unlikely to be a problem.
(BSD returns ECONNABORTED in some other calls not accept()).
The BSD accept() does not return ECONNABORTED and silently hangs
on aborted aconnection which is regraded as a problem. THe X/Open and
Posix specified accept() (future semantics ?) require ECONNABORTED from
accept() so in future more platforms will be returning and error in the
"normal" path that Apache will log. [ It will just change to ECONNABOPRTED
instead of the admittedly broken EPROTO which conveys confusing semantics ]

> Right now we are having trouble with a mess of connections getting stuck
> in FIN_WAIT_2 and crashing boxes without a timeout for that state, so we
> may not get a chance to investigate this before 1.2... 

I have followed some of that discussion. My $0.02 worth is that it may
not be something for Apache to fix. Looking at the BSD patch, it seems it
has had the timeout in one code path to FIN_WAIT_2 and not another. This has
to be fixed in TCP. At the TCP implmentors BOF at last IETF, I had mentioned
this as one of those "unwritten" (as not in RFC793 and RFC1122) caveats for
TCP implmentors that needs to be documented. This needs to be fixed in TCP.

Solaris 2.2 was bitten by it and this was fixed in a patch long time ago
for us. Some of the postings have quoted a "2 MSL" (twice maximum segement
lifetime) as timer which is incorrect. (confusion created by name of variables
in BSD code). It is a completely separate timeout. In Solaris 2.x it is
controlled  by tcp_finwait2_flush_interval tuneable through ndd.

Something interesting to find out would be a study that identifies what
TCP is at the other end when these things happen. The speculation for us
was that it was PC users powering off their machine instead of doing a
graceful exit from applications. My suspicion is that it is some broken
PC TCP stack that when exiting from applications, does not terminate the
connections properly (does not issue a close() which sends a FIN).

-Mukesh Kacker
 Internet Engineering

View raw message