httpd-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject cvs commit: apache-2.0/src/lib/apr/buckets doc_stacked_io.txt
Date Thu, 13 Jul 2000 05:15:41 GMT
fielding    00/07/12 22:15:40

  Added:       src/lib/apr/buckets doc_stacked_io.txt
  Early design notes from Ed and Alexei [and Dean]
  Submitted by:	Ed Korthof, Alexei Kosut, Dean Gaudet
  Revision  Changes    Path
  1.1                  apache-2.0/src/lib/apr/buckets/doc_stacked_io.txt
  Index: doc_stacked_io.txt
  [djg: comments like this are from dean]
  This past summer, Alexei and I wrote a spec for an I/O Filters API... 
  this proposal addresses one part of that -- 'stacked' I/O with buff.c. 
  We have a couple of options for stacked I/O: we can either use existing
  code, such as sfio, or we can rewrite buff.c to do it.  We've gone over
  the first possibility at length, though, and there were problems with each
  implemenation which was mentioned (licensing and compatibility,
  specifically); so far as I know, those remain issues. 
  Btw -- sfio will be supported w/in this model... it just wouldn't be the
  basis for the model's implementation. 
       -- Ed Korthof        |  Web Server Engineer --
       --    |  Organic Online, Inc --
       -- (415) 278-5676    |  Fax: (415) 284-6891 --
  Stacked I/O With BUFFs
  	1.) Overview
  	2.) The API
  		User-supplied structures
  		API functions
  	3.) Detailed Description
  		The bfilter structure
  		The bbottomfilter structure
  		The BUFF structure
  		Public functions in buff.c
  	4.) Efficiency Considerations
  		Memory copies
  		Function chaining
  	5.) Code in buff.c
  		Default Functions
  		Heuristics for writev
  		Flushing data
  		Closing stacks and filters
  		Flags and Options
  The intention of this API is to make Apache's BUFF structure modular
  while retaining high efficiency.  Basically, it involves rewriting
  buff.c to provide 'stacked' I/O -- where the data passed through a
  series of 'filters', which may modify it.
  There are two parts to this, the core code for BUFF structures, and the
  "filters" used to implement new behavior.  "filter" is used to refer to
  both the sets of 5 functions, as shown in the bfilter structure in the
  next section, and to BUFFs which are created using a specific bfliter.
  These will also be occasionally refered to as "user-supplied", though
  the Apache core will need to use these as well for basic functions.
  The user-supplied functions should use only the public BUFF API, rather
  than any internal details or functions.  One thing which may not be
  clear is that in the core BUFF functions, the BUFF pointer passed in
  refers to the BUFF on which the operation will happen.  OTOH, in the
  user-supplied code, the BUFF passed in is the next buffer down the
  chain, not the current one.
  		The API
  	User-supplied structures
  First, the bfilter structure is used in all filters:
      typedef struct {
        int (*writev)(BUFF *, void *, struct iovect *, int);
        int (*read)(BUFF *, void *, char *, int);
        int (*write)(BUFF *, void *, const char *, int);
        int (*flush)(BUFF *, void *, const char *, int, bfilter *);
        int (*transmitfile)(BUFF *, void *, file_info_ptr *);
        void (*close)(BUFF *, void *);
      } bfilter;
  bfilters are placed into a BUFF structure along with a
  user-supplied void * pointer.
  Second, the following structure is for use with a filter which can
  sit at the bottom of the stack:
      typedef struct {
        void *(*bgetfileinfo)(BUFF *, void *);
        void (*bpushfileinfo)(BUFF *, void *, void *);
      } bbottomfilter;
  	BUFF API functions
  The following functions are new BUFF API functions:
  For filters:
  BUFF * bcreatestack(pool *p, int flags, struct bfilter *,
                      struct bbottomfilter *, void *);
  BUFF * bpushfilter (BUFF *, struct bfilter *, void *);
  BUFF * bpushbuffer (BUFF *, BUFF *);
  BUFF * bpopfilter(BUFF *);
  BUFF * bpopbuffer(BUFF *);
  void bclosestack(BUFF *);
  For BUFFs in general:
  int btransmitfile(BUFF *, file_info_ptr *);
  int bsetstackopts(BUFF *, int, const void *);
  int bsetstackflags(BUFF *, int, int);
  Note that a new flag is needed for bsetstackflags:
  The current bcreate should become
  BUFF * bcreatebuffer (pool *p, int flags, struct bfilter *, void *);
  		Detailed Explanation
  	bfilter structure
  The void * pointer used in all these functions, as well as those in the
  bbottomfilter structure and the filter API functions, is always the same
  pointer w/in an individual BUFF.
  The first function in a bfilter structure is 'writev'; this is only
  needed for high efficiency writing, generally at the level of the system
  interface.  In it's absence, multiple writes will be done w/ 'write'.
  Note that defining 'writev' means you must define 'write'.
  The second is 'write'; this is the generic writing function, taking a BUFF
  * to which to write, a block of text, and the length of that block of
  text.  The expected return is the number of characters (out of that block
  of text) which were successfully processed (rather than the number of
  characters actually written). 
  The third is 'read'; this is the generic reading function, taking a BUFF *
  from which to read data, and a void * buffer in which to put text, and the
  number of characters to put in that buffer.  The expected return is the
  number of characters placed in the buffer.
  The fourth is 'flush'; this is intended to force the buffer to spit out
  any data it may have been saving, as well as to clear any data the
  BUFF code was storing.  If the third argument is non-null, then it
  contains more text to be printed; that text need not be null terminated,
  but the fourth argument contains the length of text to be processed.  The
  expected return value should be the number of characters handled out
  from the third argument (0 if there are none), or -1 on error.  Finally,
  the fifth argument is a pointer to the bfilter struct containing this
  function, so that it may use the write or writev functions in it.   Note
  that general buffering is handled by BUFF's internal code, and module
  writers should not store data for performance reasons.
  The fifth is 'transmitfile', which takes as its arguments a buffer to
  which to write (if non-null), the void * pointer containing configuration
  (or other) information for this filter, and a system-dependent pointer
  (the file_info_ptr structure will be defined on a per-system basis)
  containing information required to print the 'file' in question.
  This is intended to allow zero-copy TCP in Win32.
  The sixth is 'close'; this is what is called when the connection is being
  closed.  The 'close' should not be passed on to the next filter in the
  stack.  Most filters will not need to use this, but if database handles
  or some other object is created, this is the point at which to remove it.
  Note that flush is called automatically before this.
  	bbottomfilter Structure
  The first function, bgetfileinfo, is designed to allow Apache to get
  information from a BUFF struct regarding the input and output sources.
  This is currently used to get the input file number to select on a
  socket to see if there's data waiting to be read.  The information
  returned is platform specific; the void * pointer passed in holds
  the void * pointer passed to all user-supplied functions.
  The second function, bpushfileinfo, is used to push file information
  onto a buffer, so that the buffer can be fully constructed and ready
  to handle data as soon as possible after a client has connected.
  The first void * pointer holds platform specific information (in
  Unix, it would be a pair of file descriptors); the second holds the
  void * pointer passed to all user-supplied functions.
  [djg: I don't think I really agree with the distinction here between
  the bottom and the other filters.  Take the select() example, it's
  valid for any layer to define a fd that can be used for select...
  in fact it's the topmost layer that should really get to make this
  definition.  Or maybe I just have your top and bottom flipped.  In
  any event I think this should be part of the filter structure and
  not separate.]
  	The BUFF structure
  A couple of changes are needed for this structure: remove fd and
  fd_in; add a bfilter structure; add a pointer to a bbottomfilter;
  add three pointers to the next BUFFs: one for the next BUFF in the
  stack, one for the next BUFF which implements write, and one
  for the next BUFF which implements read.
  	Public functions in buff.c
  BUFF * bpushfilter (BUFF *, struct bfilter *, void *);
  This function adds the filter functions from bfilter, stacking them on
  top of the BUFF.  It returns the new top BUFF, or NULL on error.
  BUFF * bpushbuffer (BUFF *, BUFF *);
  This function places the second buffer on the top of the stack that
  the first one is on.  It returns the new top BUFF, or NULL on error.
  BUFF * bpopfilter(BUFF *);
  BUFF * bpopbuffer(BUFF *);
  Unattaches the top-most filter from the stack, and returns the new
  top-level BUFF, or NULL on error or when there are no BUFFs
  remaining.  The two are synonymous.
  void bclosestack(BUFF *);
  Closes the I/O stack, removing all the filters in it.
  BUFF * bcreatestack(pool *p, int flags, struct bfilter *,
                      struct bbottomfilter *, void *);
  This creates an I/O stack.  It returns NULL on error.
  BUFF * bcreatebuffer(pool *p, int flags, struct bfilter *, void *);
  This creates a BUFF for later use with bpushbuffer.  The BUFF is
  not set up to be used as an I/O stack, however.  It returns NULL
  on error.
  int bsetstackopts(BUFF *, int, const void *);
  int bsetstackflags(BUFF *, int, int);
  These functions, respectively, set options on all the BUFFs in a
  stack.  The new flag, B_MAXBUFFERING is used to disable a feature
  described in the next section, whereby only the first and last
  BUFFs will buffer data.
  		Efficiency Considerations
  All input and output is buffered by the standard buffering code.
  People writing code to use buff.c should not concern themselves with
  buffering for efficiency, and should not buffer except when necessary.
  The write function will typically be called with large blocks of text;
  the read function will attempt to place the specified number of bytes
  into the buffer.
  Dean noted that there are possible problems w/ multiple buffers;
  further, some applications must not be buffered.  This can be
  partially dealt with by turning off buffering, or by flushing the
  data when appropriate.
  However, some potential problems arise anyway.  The simplest example
  involves shrinking transformations; suppose that you have a set
  of filters, A, B, and C, such that A outputs less text than it
  recieves, as does B (say A strips comments, and B gzips the result).
  Then after a write to A which fills the buffer, A writes to B.
  However, A won't write enough to fill B's buffer, so a memory copy
  will be needed.  This continues till B's buffer fills up, then
  B will write to C's buffer -- with the same effect.
  [djg: I don't think this is the issue I was really worried about --
  in the case of shrinking transformations you are already doing 
  non-trivial amounts of CPU activity with the data, and there's
  no copying of data that you can eliminate anyway.  I do recognize
  that there are non-CPU intensive filters -- such as DMA-capable
  hardware crypto cards.  I don't think they're hard to support in
  a zero-copy manner though.]
  The maximum additional number of bytes which will be copied in this
  scenario is on the order of nk, where n is the total number of bytes,
  and k is the number of filters doing shrinking transformations.
  There are several possible solutions to this issue.  The first
  is to turn off buffering in all but the first filter and the
  last filter.  This reduces the number of unnecessary byte copies
  to at most one per byte, however it means that the functions in
  the stack will get called more frequently; but it is the default
  behavior, overridable by setting the B_MAXBUFFERING with
  bsetstackflags.  Most filters won't involve a net shrinking
  transformation, so even this will rarely be an issue; however,
  if the filters do involve a net shrinking transformation, for
  the sake of network-efficiency (sending reasonably sized blocks),
  it may be more efficient anyway.
  A second solution is more general use of writev for communication
  between different buffers.  This complicates the programing work,
  	Memory copies
  Each write function is passed a pointer to constant text; if any changes
  are being made to the text, it must be copied.  However, if no changes
  are made to the text (or to some smaller part of it), then it may be
  sent to the next filter without any additional copying.  This should
  provide the minimal necessary memory copies.
  [djg: Unfortunately this makes it hard to support page-flipping and
  async i/o because you don't have any reference counts on the data.
  But I go into a little detail that already in docs/page_io.]
  	Function chaining
  In order to avoid unnecessary function chaining for reads and writes,
  when a filter is pushed onto the stack, the buff.c code will determine
  which is the next BUFF which contains a read or write function, and
  reads and writes, respectively, will go directly to that BUFF.
  writev is a function for efficient writing to the system; in terms of
  this API, however, it also works for dealing with multiple blocks of
  text without doing unnecessary byte copies.  It is not required.
  Currently, the system level writev is used in two contexts: for
  chunking and when a block of text is writen which, combined with
  the text already in the buffer, would make the buffer overflow.
  writev would be implemented both by the default bottom level filter
  and by the chunking filter for these operations.  In addition, writev
  may, be used, as noted above, to pass multiple blocks of text w/o
  copying them into a single buffer.  Note that if the next filter does
  not implement writev, however, this will be equivalent to repeated
  calls to write, which may or may not be more efficient.  Up to
  IOV_MAX-2 blocks of text may be passed along in this manner.  Unlike
  the system writev call, the writev in this API should be called only
  once, with a array with iovec's and a count as to the number of
  iovecs in it.
  If a bfilter defines writev, writev will be called whether or not
  NO_WRITEV is set; hence, it should deal with that case in a reasonable
  [djg: We can't guarantee atomicity of writev() when we emulate it.
  Probably not a problem, just an observation.]
  		Code in buff.c
  	Default Functions
  The default actions are generally those currently performed by Apache,
  save that they they'll only attempt to write to a buffer, and they'll
  return an error if there are no more buffers.  That is, you must implement
  read, write, and flush in the bottom-most filter.
  Except for close(), the default code will simply pass the function call
  on to the next filter in the stack.  Some samples follow.
  	Heuristics for writev
  Currently, we call writev for chunking, and when we get a enough so that
  the total overflows the buffer.  Since chunking is going to become a
  filter, the chunking filter will use writev; in addition, bwrite will
  trigger bwritev as shown (note that system specific information should
  be kept at the filter level):
  in bwrite:
      if (fb->outcnt > 0 && nbyte + fb->outcnt >= fb->bufsiz) {
          /* build iovec structs */
          struct iovec vec[2];
          vec[0].iov_base = (void *) fb->outbase;
          vec[0].iov_len = fb->outcnt;
          fb->outcnt = 0;
          vec[1].iov_base = (void *)buff;
          vec[1].iov_length = nbyte;
          return bwritev (fb, vec, 2);
      } else if (nbye >= fb->bufsiz) {
          return write_with_errors(fb,buff,nbyte);
  Note that the code above takes the place of large_write (as well
  as taking code from it).
  So, bwritev would look something like this (copying and pasting freely
  from the current source for writev_it_all, which could be replaced):
  int bwritev (BUFF * fb, struct iovec * vec, int nvecs) {
      if (!fb)
          return -1; /* the bottom level filter implemented neither write nor
                      * writev. */
      if (fb->bfilter.bwritev) {
          return bf->bfilter.writev(fb->next, vec, nvecs);
      } else if (fb->bfilter.write) {
          /* while it's nice an easy to build the vector and crud, it's painful
           * to deal with partial writes (esp. w/ the vector)
          int i = 0,rv;
          while (i < nvecs) {
              do {
                  rv = fb->bfilter.write(fb, vec[i].iov_base, vec[i].iov_len);
              } while (rv == -1 && (errno == EINTR || errno == EAGAIN)
                       && !(fb->flags & B_EOUT));
              if (rv == -1) {
                  if (errno != EINTR && errno != EAGAIN) {
                      doerror (fb, B_WR);
                  return -1;
              fb->bytes_sent += rv;
              /* recalculate vec to deal with partial writes */
              while (rv > 0) {
                  if (rv < vec[i].iov_len) {
                      vec[i].iov_base = (char *)vec[i].iov_base + rv;
                      vec[i].iov_len -= rv;
                      rv = 0;
                      if (vec[i].iov_len == 0) {
                  } else {
                      rv -= vec[i].iov_len;
              if (fb->flags & B_EOUT)
                  return -1;
          /* if we got here, we wrote it all */
          return 0;
      } else {
          return bwritev(fb->next,vec,nvecs);
  The default filter's writev function will pretty much like
  The general case for writing data is significantly simpler with this
  model.  Because special cases are not dealt with in the BUFF core,
  a single internal interface to writing data is possible; I'm going
  to assume it's reasonable to standardize on write_with_errors, but
  some other function may be more appropriate.
  In the revised bwrite (which I'll ommit for brievity), the following
  must be done:
  	check for error conditions
  	check to see if any buffering is done; if not, send the data
  		directly to the write_with_errors function
  	check to see if we should use writev or write_with_errors
  		as above
  	copy the data to the buffer (we know it fits since we didn't
  		need writev or write_with_errors)
  The other work the current bwrite is doing is
  	ifdef'ing around NO_WRITEV
  	numerous decisions regarding whether or not to send chunks
  Generally, buff.c has a number of functions whose entire purpose is
  to handle particular special cases wrt chunking, all of which could
  be simplified with a chunking filter.
  write_with_errors would not need to change; buff_write would.  Here
  is a new version of it:
  /* the lowest level writing primitive */
  static ap_inline int buff_write(BUFF *fb, const void *buf, int nbyte)
      if (fb->bfilter.write)
          return fb->bfilter.write(fb->next_writer,buff,nbyte);
          return bwrite(fb->next_writer,buff,nbyte);
  If the btransmitfile function is called on a buffer which doesn't implement
  it, the system will attempt to read data from the file identified
  by the file_info_ptr structure and use other methods to write to it.
  One of the basic reading functions in Apache 1.3b3 is buff_read;
  here is how it would look within this spec:
  /* the lowest level reading primitive */
  static ap_inline int buff_read(BUFF *fb, void *buf, int nbyte)
      int rv;
      if (!fb)
          return -1; /* the bottom level filter is not set up properly */
      if (fb->
          return fb->>next_reader,buf,nbyte,fb->bfilter_info);
          return bread(fb->next_reader,buff,nbyte);
  The code currently in buff_read would become part of the default
  	Flushing data
  flush will get passed on down the stack automatically, with recursive
  calls to bflush.  The user-supplied flush function will be called then,
  and also before close is called.  The user-supplied flush should not
  call flush on the next buffer.
  [djg: Poorly written "expanding" filters can cause some nastiness
  here.  In order to flush a layer you have to write out your current
  buffer, and that may cause the layer below to overflow a buffer and
  flush it.  If the filter is expanding then it may have to add more to
  the buffer before flushing it to the layer below.  It's possible that
  the layer below will end up having to flush twice.  It's a case where
  writev-like capabilities are useful.]
  	Closing Stacks and Filters
  When a filter is removed from the stack, flush will be called then close
  will be called.  When the entire stack is being closed, this operation
  will be done automatically on each filter within the stack; generally,
  filters should not operate on other filters further down the stack,
  except to pass data along when flush is called.
  	Flags and Options
  Changes to flags and options using the current functions only affect
  one buffer.  To affect all the buffers on down the chain, use
  bsetstackopts or bsetstackflags.
  bgetopt is currently only used to grab a count of the bytes sent;
  it will continue to provide that functionality.  bgetflags is
  used to provide information on whether or not the connection is
  still open; it'll continue to provide that functionality as well.
  The core BUFF operations will remain, though some operations which
  are done via flags and options will be done by attaching appropriate
  filters instead (eg. chunking).
  [djg: I'd like to consider filesystem metadata as well -- we only need
  a few bits of metadata to do HTTP: file size and last modified.  We
  need an etag generation function, it is specific to the filters in
  use.  You see, I'm envisioning a bottom layer which pulls data out of
  a database rather than reading from a file.]

View raw message