httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Slemko <>
Subject Re: paper on "Promoting the Use of End-to-End Congestion Control" (fwd)
Date Wed, 04 Mar 1998 19:08:39 GMT
Dean, that explains it. Sigh.

---------- Forwarded message ----------
Date: Wed, 4 Mar 1998 11:25:00 -0600 (CST)
From: David Borman <dab@BSDI.COM>
To: end2end-interest@ISI.EDU
Subject: Re: paper on "Promoting the Use of End-to-End Congestion  Control"

All this discussion about the Nagle algorithm, buffer sizes
read/write sizes, etc. has been interesting.  I'm not responding
to any particular message, but there is some details about the
typical BSD implementation that have not yet been discussed.

So, having been through this code in detail, here's how things
work in the typical 4.4 BSD based implementation.

The application does a write() system call.  Assume the app
does a large write, say at least 4K or 8K.  Down into the kernel
we go, into sosend().  This is where the data is copied from
user space down into the kernel, in chunks of MCLBYTES in
length.  However, for stream protocols, it copies down one
chunk of MCLBYTES, and then hands that down to tcp_usrreq()
which puts the data on the send queue and calls tcp_output().
When it returns, it copies down the next MCLBYTES, and hands
that to tcp_output(), and so on until all the data has been
copied down.  If there is more data than will fit into the
socket send buffer, it pauses between copies waiting for data
to drain out (i.e, when it is ACKed and dropped from the front
of the queue).

Now lets go down into tcp_output().  At the top, it decides
whether or not the connection is idle, based on whether or not
there is outstanding unACKed data.  If the connection is idle,
then all the data on the send queue will be sent in as many
packets as it takes, as long as it all fits in the congestion

Ok, have you picked up yet on the problem here?  The data is
being handed to tcp_output() in MCLBYTE chunks, and only when
processing the first chunk will tcp_output() consider the
connection idle.  Because it is idle, it will send out all the
data.  If MCLBYTES = 2048, and we are dealing with an ethernet,
that will get divided into two packets, one full size at 1440
bytes of data, and a smaller one with 608 bytes of data.  The
send buffer is a multiple of 1440 (the mss), so as additional
data gets written to the socket, eventually it is full with
some number of additional 1440 sized packets, plus a remainder
of 832 bytes (1440 - 608).  Now, have we seen these numbers

Rick Jones wrote:
> Interesting.  A request for a 64k (+ headers) document from an Apache
> server running FreeBSD 2.2.x ( resulted in
> segments sized: 
>    1 1406
>    1 446
>    1 798
>    2 224
>    2 866
>    4 832
>   12 608
>   35 1440

There they are, 1440, 608 and 832.  As ACKs come back and the
application continues to write, the interaction between write(),
sosend() and tcp_output() will continue to chop up the data at
times into odd sized pieces, as witnessed by Rick Jones.

I know about all this because I spent a fair amount of time
addressing this in BSD/OS.  (See http://www.BSDI.COM/press/19960827
for the press release about the patch that included these changes)

To properly address this situation, two things need to happen.
First, tcp_output() needs to maintain memory from call to call
about whether or not it was "idle" on the previous call and
whether or not it should still consider the connection idle,
and secondly, tcp_output() needs to be able to call back up to
find out if sosend() has more data waiting to be copied down.

So, in a properly working world, if the app does a 4K write,
the first 2K comes down to tcp_output(), and it sends out 1440
bytes.  But it notices that sosend() has more data to give it,
so it doesn't send out the remaining 608 bytes, even though the
connection is idle.  The next 2K comes down, and it sends out
another 1440 byte packet.  At this point, the congestion window
is full, so we don't send anything else out.  The app does another
4K write, and the data all gets copied down into the socket buffer.
As the ACKs come back, we send more data in nice 1440 byte sized
chunks, until the end of the data steam when we send out the
final trailing piece.

BTW, the problems with the current BSD code are why if you have
an FDDI interface and MCLBYTES is 2K, the initial data packet
sent out over the 4K-MTU connection is only 2K, even if the
application wrote 4K or more of data!

After our performance changes, a typical TCP data transfer
where the application is doing larger writes (4K to 8k or
more) over an ethernet, and the app can write the data faster
than the network can send it, results in a stream of 1440 byte
packets, followed by one trailing packet for the final data.
(Note, we consider the ACK of the initial SYN to open up the
congestion window, so when the first data is written we'll
send out 2 packets, not just one.  That gets around the initial
delayed ACK problem)  My goal was to eliminate all artificial
delays (due to delayed acks, small packets and Nagle), and I
believe that I accomplished that in BSD/OS, without compromising
TCP or any of the congestion control code.

		-David Borman,

View raw message