httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Gaudet <>
Subject Re: filter API spec
Date Wed, 03 Sep 1997 00:13:03 GMT
On Tue, 2 Sep 1997, Alexei Kosut wrote:

> On Tue, 2 Sep 1997, Dean Gaudet wrote:
> > I haven't read this completely yet, but it looks like you're missing
> > writev() -- it's needed for performance (avoiding user-to-user copies).
> > Without it you can't implement chunking as a filter as efficiently as I
> > presently have.
> But BUFF doesn't provide writev now, except internally to how it does
> chunking. And as I understand writev, you couldn't use it if there are
> filters stacked below chunking, and if there weren't, we'd use the
> existing approach (which does use writev).

There is no reason to provide bwritev() at the moment, because nothing at
the application level above BUFF uses it.  Nothing up there typically
needs to use it ... they either have a need for fully buffered output
(i.e. the headers, mod_include, mod_autoindex) or for completely
unbuffered output (i.e. default_handler, or mod_cgi).  The former
necessarily involve user-space memory copies because the output is being
generated on the fly.  The latter two involve mass output of things
already in memory somewhere ... and it's the latter two cases which cause
the writev() code in buff.c to go to work. 

Consider just the lowest level of your filter stack, i.e. what is
currently implemented in buff.c.  You have two goals on output -- reduce
the number of write() system calls (because system calls are expensive,
and have a tendancy of defining packet boundaries on the network), and you
have to reduce the amount of time you spend copying one piece of memory to
another piece of memory (i.e. to stuff more into the buffer).  The current
buff.c code uses heuristics to work on both these goals.  writev() is the
crucial piece that makes both of these goals attainable -- when a writev() 
is used you have a single system call, and you haven't copied memory

There are two ways that writev() is used, one related to chunking.  When a
"large" bwrite occurs that would overflow the buffer the buff code takes
that opportunity to form a two (or four if chunking) element writev() --
the first containing the current contents of the buffer, and if not
chunking the second is the array passed to bwrite().  If chunking the
second is a chunk header, the third is the array, and the fourth is a
chunk trailer.  Note that in this case no memory to memory copying has
been done. 

writev() is also used when a naked chunk has to be written and the buffer
is empty, or when doing unbuffered chunked output.

I want to see chunking implemented as a filter, it would vastly simplify
the buff.c code.  But with your current proposal we cannot do this without
either unnecessary memory copying or without extra system calls.

Something to consider ... apache pre 1.3 have the following behaviour
for each byte output by send_fd (used by the default_handler and mod_cgi):

 -> paged into the kernel/copied to pipe buffer in kernel
 -> copied to user buffer (via read())
 -> copied another user buffer (in the BUFF)
 -> copied to a kernel network buffer (via write())

and 1.3, for large outputs (4k+), default_handler:

 -> paged via mmap()
 -> write() or writev() causes copy to network buffer

and cgi:

 -> copied to a pipe buffer in kernel
 -> copied to a user buffer (via read())
 -> write() or writev() causes copy to network buffer

Note that the default_handler is almost a zero-copy system when it's using
mmap().  In fact under Solaris 2.6 with Sun's ATM card that does
scatter-gather and checksum calculations and so on the default_handler
case is zero copy.  The kernel directs the disk to read in pages (mmap),
then instructs the ATM card to construct packets out of those pages, never
does the CPU have to touch a single byte -- it's really easy to maintain
wire-speed transfers under load in this config.

The CGI case could be made faster on some systems that support "page
flipping".  If we were to page-align (usually 4k or 8k) our buffers, and
CGIs also aligned its buffers then the kernel could use copy-on-write
to avoid a copy of the data into the pipe buffer ... and instead temporarily
"merge" the two pages in each process.  I didn't bother with this
because it requires CGIs to be written expecting it (they typically need
to have two page-sized, page-aligned buffers and alternate writing each
so that the copy-on-write doesn't take effect, and presumably their
time slice will be up before they've filled both buffers).  But it's
something to remember down the road ... fastcgi for example might be able
to make use of it.

> > Multiple layers of buffers is dangerous, but this is really an
> > implementation detail in each filter.  Filters should be able to
> > completely disable buffering. 
> I suppose - which way, though? Do they disable buffering in their own
> BUFF, or the one they sit on top of?. Why are multiple layers of buffers
> dangerous, anyhow?

Ok dangerous in two ways... the first is performance, the second is in
the actual implementation of the protocol.  The second is dealt with
by the flush function ... but I have suspicion that we may need to
differentiate between a flush that has to go all the way down the stack
and a flush that needs only to go one layer.  The first is really just
a matter of implementation -- filters could be written really poorly
(i.e. inducing a lot of user space memory copying because they work
character by character rather than block by block).

It would just totally suck if the "path" for a byte during chunking were
to be first put into a buffer at the application level, say via rprintf,
copied into a buffer for the chunking filter, and then copied
(via the write method) to the lowest layer buffer, and finally given
to the operating system via write() (which would then bust it up into
network packets involving another copy).  Contrasted with the current code
where that copy in the middle doesn't happen -- the rprintf buffer
is passed copied once (or maybe even not at all -- writev) into the
final buffer that is later sent to the operating system.

I think you've got this part almost right, my only concern right now is
that we may need two forms of flush ... I'll think about it some more.


View raw message