httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Gaudet <>
Subject zero-copy and mux
Date Sun, 20 Jun 1999 18:01:46 GMT

On Sun, 20 Jun 1999, Ben Laurie wrote:

> What about HTTPng? Muxing? Is using multiple buffers (temporarily,
> presumably) really that much of a problem?
> It seems a shame to lose this, because the ability to do it proves the
> abstraction is good.

Let's run through it a bit... our mux protocol will look something like

    struct packet {
	int connection_id;
	unsigned num_bytes;
	char data[num_bytes];

oversimplified of course.

Suppose we have, say 4 requests in-progress, and 4 threads generating
responses.  Those will all be writing to individual BUFFs.  Eventually one
or more of them will have to flush their BUFF.

It'll call down into iol_mux, which will have a mutex to prevent all
4 threads from entering.  Can iol_mux decide to buffer the response at
this point?  I think not -- the upper layer really wanted a flush at this
point, or it would not have flushed (assume we got that right, because
we'll get it right for the non-mux case, and the code is the same).
So iol_mux has to send the packet at this point.

Maybe I'm wrong, maybe iol_mux has the option of buffering the packet.
The heuristic we might use is "this is a small packet, and we know there
are other requests in progress, and/or there is data to be read on this
or other connections" (i.e. an improved "saferead"/halfduplex heuristic).
The corresponding code in apache 1.x at this point would copy the packet
into a buffer... it is small after all.  Similarly if the packet was
large, and there was already buffered stuff in the iol_mux then we could
use a writev() combining the existing mux buffer and the new packet,
much like we choose to do large_write() in apache 1.x.  We don't have
zero-copy, but we really only have partial-copy, just like apache
1.x... and I'm pretty sure it's good enough.

That last case is the same even if there's another thread trying to send
data over the mux -- the mux may have an existing buffer (of previous
small responses) and choose to combine it with the writev().

Notice the mux layer can put packet headers on with a writev() as well,
just like we did with chunking large packets.

And if encryption is sitting after the mux it is going to take all these
writev() fragments and combine them into one (or more) larger buffers
and write() those... a copy we can't avoid anyhow.

My argument is essentially this: partial-copy, like we have already,
is about as expensive as the overhead of zero-copy.

The mux layer sees small packets only when total responses are small --
the BUFF above the mux ensures that.  The "tail" of a response is always a
"small" packet, but there we have a similar saferead/halfduplex heuristic
by which we may or may not buffer it.

Dunno really.  I don't have any numbers to back up my claim.  All I have
is one implementation to compare with, a TCP-TCP proxy (think socksgw
on steroids) which was initially one-copy, and which I rewrote with a
zero-copy implementation.  The zero-copy implementation has the full
generality I posted first -- buffer_heads, buffer_lists, and buffers.
It has a lot of nice optimizations in it, but the API is a little more
general that it needs to be.  At any rate the zero-copy version breaks
even compared to the one-copy version.

Maybe something to remember which might help convince folks.  With present
100baseT hardware, your kernel is going to make one-copy of all your data
regardless -- because it has to assemble TCP packets to send off to the
network card.  If you've already done one-copy just before entering the
kernel there's a high chance that the entire 4k packet is still sitting in
your L1 data cache when the kernel needs it.  Optimistically it'll take
the kernel, say 200 32-bit operations to copy that 4k data into network
packets... that's 200 cycles, or .5us on a 400Mhz processor.  Worst case
scenario is that all the data is in the L2, and the L2 is say 10 cycles
away.  Then your cost is 5.5us... which is above the one-copy cost you had
to pay anyhow.

OK ok, so there is gigabit ethernet and ATM hardware which can do TCP
packet assembly.  And suppose we care about it in the apache 2.0 timeframe
(as opposed to a 2.1 or later timeframe).  In solaris 7 they implemented
true zero-copy, but it only worked on the page-aligned data that was going
from disk to the ATM card, the rest was one-copy (for assembly).  We
support this -- this is what large_write() with its writev() usage is
intended to support (actually it doesn't work with solaris 7 on the first
32k of the file, but the sun engineer told me he was thinking about how to
support the writev we use).  I suspect that other folks doing true
zero-copy are going to have similar restrictions -- disk -> net optimized,
memory -> net unoptimized... and we're back to that 5.5us cost. 

Let's just say I remain unconvinced.  I think our profile will have bigger
fish to fry than this. 


View raw message