apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Stein <gst...@lyra.org>
Subject Re: [RFC] Network Abstraction Layer
Date Fri, 02 Mar 2001 10:06:17 GMT
Getting long here. Watch out! :-)


On Thu, Mar 01, 2001 at 08:17:10PM +0100, Elrond wrote:
> On Wed, Feb 28, 2001 at 07:40:21AM -0800, Greg Stein wrote:
>...
> > It might be interesting to examine the filters that we have in Apache right
> > now. They provide for the protocol-stacking, buffering, and (hey!)
> > filtering.
> 
> The filters are currently in Apache itself?

Yup.

> That might explain, why I didn't find anything relating to
> filter in apr-util.
> 
> If filters actualy are, what we're looking for, it would be
> nice, if the base of the filters (not the filters
> themselves) be moved into apr-util.
> This might be a good idea anyway.

As I stated elsewhere, there is actually not a lot of stuff related to
filters. Most of the code deals with registration of the filter, rather than
their use. I'm not sure how much has broad utility.

In a nutshell, a filter function has this prototype:

  typedef apr_status_t (*ap_out_filter_func)(ap_filter_t *f,
                                             apr_bucket_brigade *bb);

ap_filter_t has a couple pieces of metadata, the filter function pointer,
and a pointer to the "next" filter. A filter function looks something like:

  filter_func(f, bb)
  {
      /* manipulate bb in interesting ways */
      
      return ap_pass_brigade(f->next, bb);
  }

Where ap_pass_brigade is essentially:

  {
      return f->func(f, bb);
  }

The meat of the filter code deals with naming the filters and then creating
the linked list of filters (we insert by name, and filters have an inherent
ordering).

>...
> I've read the complete apr_buckets.h.
> From a _very_ high point of view, bucket brigades are a
> list of buckets, and buckets are simply referring to static
> read-only data, which was (or will be) created.

This is correct. A brigade's contents are read-only, but may be generated
*during* a bucket's read() function (e.g. read the content from a pipe,
socket, or file).

"Manipulating" a brigade involves splitting buckets, inserting or removing
portions, or simply replacing portions. For example, let's say you are a
filter handling server-side-includes (SSI), and you get a brigade with a
single bucket with the string:

  BRIGADE
    \
    BUCKET
     "duration = <!--#var foo --> seconds"

The filter will split this into three buckets:

  BRIGADE
    \
    BUCKET ---------- BUCKET -------------- BUCKET
     "duration = "     "<!--#var foo -->"    " seconds"

It then replaces the middle bucket with a new bucket representing the
request values. Note that the first and last buckets do not copy the value;
they simply point to different parts of the original bucket's value.

> So this is a nice way of moving around data basicaly.

It is a way to do zero-copy of content (presuming that is feasible within
the constraints of the filters' manipulations; a gzip filter simply can't
retain the original data :-)

Since one of the bucket types represents a FILE, you can also inject a file
into the brigade and send that down the output stack. Let's say that nothing
in the chain happens to need the file contents. The file then arrives at the
bottom "filter" of the chain, and sendfile() is used to shove the thing out
onto the network.

This zero-copy is quite feasible when you consider a brigade transform such
as this:

  BRIGADE = { FILE, EOS }    # FILE bucket and EOS (End Of Stream) bucket

becomes

  BRIGADE = { "packet header bytes", FILE, EOS }

We inserted a header without touching the file. The output filter generates
a sendfile which contains an iovec for those header bytes. Blam! Out the
network it goes... :-)

>...
> Some of our protocol layers might be actualy doable as
> filters. (SMB itself quiet likely, at least for the
> fileserving part, SMB does more than fileserving, but it
> probably resembles http in some ways)

Possibly doable. I merely bring it up as something for you to look at. The
brigades are a great way to move around content.

> The part, where Sander's NAL comes in, is the one filter
> that writes stuff to the network:
> 
> In Apache this is most likely a simple unix tcp socket,
> into which you write data.
> NAL tries to abstract sockets.

Right!

> The issue here is, that the protocols, that are below SMB,
> are more like sockets. From a clean point of view, they
> should be implemented in the kernel, but none of us is
> willing to write kerneldrivers for each OS out there.
> 
> If you look through Sander's proposal, nearly all typical
> Unix socket operations are represened as function pointers,
> and actualy nearly each of them needs to be implemented,
> and most of them will actualy do something complex.

Yep, I believe that I understood that part. Note that, in the filter stack
approach, the last "filter" is the thing that delivers to the network. Using
this point, it is possible to define each of those custom "sockets" as one
of these end-point filters.

In Apache, that end-point is a plain socket, as you describe.

In Samba, it could be a filter that delivers to a kernel-level SMB socket.
Or it could be a user-level SMB "socket" that writes to a raw ethernet
device. Depending upon the capabilities of the platform, you insert the
appropriate end-point filter.

The "higher" filters are none-the-wiser... they just pass a brigade to the
next filter. And one of those "next filters" happens to be the last one, and
it happens to get the data onto the network.

> I see one correlation point to filters:
> 
> Maybe there should be an optional
> "append_output_filters_to_filterstack" and
> "prepend_input_filters_to_filterstack". That way, if the
> socket can be represented as filters, and the socket-writer
> has the energy to write those filters, that can be
> optimized. Functions are needed, because the actual used
> filters can be quiet dynamic and could depend on the remote
> system, that we connect to, or if someone actualy
> implemented those sockets in the kernel.

I believe we are describing the same thing :-) ... We probably just have a
bit of terminology differences to come together on.

I think your function names are modelling what we do in Apache, so I'll
explain what we do, and you can see if that matches your thoughts.

In Apache, we have an "MPM" that listens at sockets waiting for connections
to arrive. When one arrives, the MPM creates a "request record" and begins
request processing. The first thing that occurs is that the MPM inserts one
filter onto the input stack, and one filter onto the output stack. The
former is an "input filter" (we've just discussed output filters above)
which knows how to populate a brigade with content from the connection (in
our case, we just put a SOCKET bucket into the brigade and we're done).

For the output filter, it knows how to write a brigade to the socket. It has
some smarts to recognize FILE buckets and sendfile() them. It can use
writev() to write a bunch of plain buckets to the socket. etc.


Now, let's say during request processing, that we want to add a gzip filter
to compress the output, and an SSI filter to process the content. These will
get added with a call such as:

  ap_add_output_filter("GZIP", ctx, request, connection)

Apache looks up the filter from the name, and inserts it into the linked
list. Same goes for the SSI filter.

The missing pieces is that each filter has a self-described ordering. This
allows the SSI filter to handle input first, followed by GZIP, followed by
the network:

output-stack = SSI -> GZIP -> network

As content is fed into the stack, SSI processes it and passes it to "next".
This is the GZIP filter which compresses it and passes it to "next". The
next one is the network filter, which drops it into the socket.


I believe you used the term "append_output_filters_to_filterstack" as a
description of how the network-filter ends up at the end of the above linked
list. We use the self-describe ordering to ensure this since it is simply
too difficult to organize the *timing* of filter insertion in Apache (if all
filters were appended, then you'd have to ensure that SSI went first, then
GZIP was appended, then the network).

For Samba, you may have more control over the timing. Heck, you may not even
need the "linked-list of filters" concept if your output paths are selected
from a pretty rigid set.

> The fallback is more or less clear:
> append_output_filters_to_filterstack defaults to a simple
> filter, that takes the whole brigade and write()s it into
> the socket.

Yes. Our first output filter was like that. Later on, we added more smarts,
so we could use sendfile, TCP_CORK, writev(), etc.

> prepend_input_filters_to_filterstack defaults to the
> opposite.
> Maybe we also might have a apr_bucket_NAL, which is a
> bucket, that reads its data from a NAL socket.

Absolutely! You've got it now :-)

In the Apache case, we simply add a plain old SOCKET bucket to the input
brigade. When the SOCKET finally "empties" itself and the input filter is
called for more, we return an APR_EOF status condition from the filter.

You could add a custom bucket type, or you could keep the complex code in
the filter. Both would work equally well, so it is more of a "design" issue
than a technical one.

Personally, I might choose to implement the NAL bucket so that it could be
used outside of the filter environment.

> I'm not too sure, which of the last ones is actualy the
> right thing to do.
> 
> 
> The next interesting point is, that NALs might have their
> own idea, of how select()/poll() should be handled for
> them.

Right. We have the same issue in Apache, and our "MPM" mechanism is used to
solve this. Some MPMs are single-process and multi-thread, some are
multi-process, etc. Each one defines whether it will select/poll on a bunch
of sockets, or whether it has a thread-per-socket that just does an accept,
or whatever. We have custom MPMs for Windows, BeOS, and OS/2, and then two
or three (styles) for Unix-ish platforms. At the point that a connection and
request arrives, is where the different MPMs come back together to a unified
request-handling system.

> I hope, that explains somewhat the intention behind NAL.

Absolutely! Like I said, we had a similar system at one point, so I saw
exactly where you were going. Here was the structure that we had:

struct ap_iol_methods {
    apr_status_t (*close)(ap_iol *fd);
    apr_status_t (*write)(ap_iol *fd, const char *buf, apr_size_t len,
                          apr_ssize_t *nbytes);
    apr_status_t (*writev)(ap_iol *fd, const struct iovec *vec, apr_size_t nvec,
                           apr_ssize_t *nbytes);
    apr_status_t (*read)(ap_iol *fd, char *buf, apr_size_t len,
                         apr_ssize_t *nbytes);
    apr_status_t (*setopt)(ap_iol *fd, ap_iol_option opt, const void *value);
    apr_status_t (*getopt)(ap_iol *fd, ap_iol_option opt, void *value);
    apr_status_t (*sendfile)(ap_iol *fd, apr_file_t * file, apr_hdtr_t * hdtr, 
                             apr_off_t * offset, apr_size_t * len, 
                             apr_int32_t flags);
    apr_status_t (*shutdown)(ap_iol *fd, int how);
    /* TODO: accept, connect, ... */
};
    
Look familiar? :-)

> Here are some examples of NALs, that are either interesting
> to us, or just fictitious:
> 
> - NetBIOS
> - unix domain sockets (with authentication)
> - NT named Pipes (those are like Unix domain sockets and
>   include authentication already, while the unix side needs
>   to add that)
> - The former two can be wrapped in a single "IPC"-Facility.
> 
> Now some fictitious ones:
> - IPX using raw sockets
> - NetBEUI using raw sockets

All very good examples!

>...
> I haven't looked at pools right now, so if you don't feel
> like answering, simple "RTFM" is okay.

Ask away. You could call us "human doc servers" :-)

> BTW: Do the things in the pool have reference counts? I
> mean, is the pool itself offering refcounting?

No reference counting.

> And I've also seen, that the included objects are included
> with a destroy-function, so one can have pools in pools,
> right?

Yes. For example, we have a connection pool that is created when a
connection first arrives. Various connection-specific stuff is placed in
there. When we begin request processing, a request pool is created as a
child of the connection pool. In many cases during the request processing,
we'll create further child pools of the request pool.

Now, let's say the request is done processing. We just toss the request pool
and everything done during that request is thrown out. No worries throughout
the code of free'ing things (given the comments about "talloc" is sounds
like you guys already know the benefits :-)

On the other hand, let's say the connection drops and we just abort the
request processing. The server will toss the connection pool and it throws
out the connection pool, its (child) request pool, etc.

Even the connection pool is a child of another. The pools can be described
as a tree of pools, with a single global pool created at startup.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Mime
View raw message