httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul J. Reder" <>
Subject Re: filter design
Date Fri, 30 Jun 2000 19:45:35 GMT
Perhaps I'm way out of line here, someone can tell me to shutup
and I'll go back to my corner, but it seems to me that we have
gotten completely bogged down in minute details of an incomplete
picture. Or perhaps I just don't understand.

I feel like there is not a complete statement of the problem and,
perhaps related to this, there is no public design of a solution.

I would like to suggest that we all take a step back and start at
the beginning. Perhaps if we publicly discuss the problem in the
abstract a solution will become obvious, or at least deficiencies
of solutions will become obvious. As a novice in this I will start
basic and build.

If this is at all useful, please feel free to build on it or
correct it so that we are all looking at the complete problem
from the same perspective. This way we can all work towards a
solution without the side effects of "bathroom agreements".*

* ("Bathroom agreements" were what we used to call the clarifying
discussions that would happen during bathroom breaks between 
people participating in ISO standards meetings. If you didn't
know someone who was in the bathroom during one of these, you
couldn't understand the background or intention of the standard.)

As I understand it, filtering provides control from beginning to
end of the generation and processing of content. This starts with
input of the information required to create/influence content,
progresses through the manipulation/alteration/generation of
"existing" content, possibly translates the content, and finally,
transmits the content.

I see filters happening at three distinct stages in this process:

1) "Input" filters: This would be filters that are charged with
   the duty of figuring out how to input content from what location.
   This might include filters to do things like input and convert
   a .jpg file into a .gif file. It might also include obtaining
   content from the cache, or dealing with a proxy.

   Is this accurate? Are there other examples? Any complex scenarios
   to be considered?

2) "Manipulation" filters: This would be filters that manipulate
   existing content. This could be parsing the content to replace
   cgi or ssi directives with the generated content, or performing
   some sort of regular expression based gargling. In short,
   anything that works with content already on hand but does not 
   require all of the final content to be on hand.

   What is the list of possible -useful- filters here? Hypothetical
   filters to demonstrate some truly pathological manipulation aren't
   what I'm asking for here. 

3) "Output" filters: This would be things like compression, chunking,
   translation, encryption, etc. In short, anything that basically
   treats the content as a whole and performs some transmogrification
   without regard to any content specifics.

   Do any of these possibly feed back into other filters or are these
   all "final" filters? Are there other filters that fall into this
   category? What about storing into a cache?
To avoid religious arguments about buckets vs. char * vs. whatever,
I will use the generic term DATA. DATA just means some generic glom
of bytes.

Input filters create new DATA. To do this they may have to temporarily
create a copy of the DATA for internal processing, but the end result
is initial content DATA to be sent in response to a request. Input
filters, for the most part, should be self contained sources of DATA
and not care about any other filters.

Manipulation filters alter/expand/contract existing DATA. It is up
to each filter to determine how best to manipulate the DATA. This
might mean copying the DATA, but should not require it. It is possible
that there may be some interaction between filters at this level. 
For example, an SSI may include an exec of a cgi which may in turn
generate content which contains another SSI directive. This doesn't
imply that the filters know about each other, just that work done
in one may impact work required in another. Order probably shouldn't
be important here because recursive manipulation should be provided
for, meaning that all of the handlers get a shot, eventually.

Output filters translate existing DATA. These filters should be
able to operate on the DATA as a stream without requiring the full
content to be present all at once. In general these filters do
not care about the HTML semantics of the content (though perhaps
the language semantics, for charset translation). These filters
will almost possibly require that a copy of the DATA be made
(unless DATA transmogrification in place is possible). These
filters should not really have any interaction with other filters,
although order may be really important (translate->compress->encrypt).

Some words about DATA:
During the Input filter processing, filters need to have access
to temporary DATA as well as be able to allocate longer lived
DATA to return their results in. Assuming that a request
may pull content from multiple Input filter sources
DATA should be able to be allocated in separate chunks and passed
back without requiring aggregation or reallocation.

During the Manipulation filter processing, DATA should be able to 
be expanded (i.e. replace a directive with generated content) or
contracted (substitute a smaller envariable value for its tag) 
without requiring aggregation or reallocation of DATA. In other
words, the structure of DATA should not force copying to happen, but
certainly should allow copying to happen efficiently.

During the Output filter processing, DATA should be used and freed.
Any copying should be from DATA to output sink or to temporary
storage and then output sink (assuming transform in transit). The
output sink of one filter may just be the input sink of another.

Some requested HTML examples to think about:

1) Containing only text.

2) Containing 10 .gif or .jpg references (perhaps filtering
   from one format to the other).

3) Containing an exec of a cgi that generates a text only file

4) Containing an exec of a cgi that generates an SSI of a text only file.

5) Containing an exec of a cgi that generates an SSI that execs a cgi
   that generates a text only file (that swallows a fly, I don't know why).

6) Containing an SSI that execs a cgi that generates an SSI that
   includes a text only file.

   NOTE: Solutions must be able to handle *both* 5 and 6. Order
         shouldn't matter.

7) Containing text that must be altered via a regular expression
   filter to change all occurrences of "rederpj" to "misguided"

8) Containing text that must be altered via a regular expression
   filter to change all occurrences of "rederpj" to "lost"

9) Containing perl or php that must be handed off for processing.

10) A page in ascii that needs to be converted to ebcdic, or from
    one code page to another.

11) Use the babelfish translation filter to translate text on a
    page from Spanish to Martian-Swahili.

12) Translate to Esperanto, compress, and encrypt the output from 
    a php program generated by a perl script called from a cgi exec
    embedded in a file included by an SSI  :)


How many of these are effectively duplicates? Which ones are
misguided or useless? What are some other real life type
examples that might shed light on certain filtering difficulties?

This is longer than I intended. Sorry about that. I hope you find
it useful, and perhaps a starting point for moving on from here.
I hope this beginners perspective (mine not yours) helps get more
people thinking about and contributing to this design.

Thanks for your perusal time.

Paul J. Reder
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein

View raw message