httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William A. Rowe, Jr." <wr...@covalent.net>
Subject Re: [PATCH] mod_ssl input filtering...
Date Mon, 08 Oct 2001 04:03:50 GMT
From: "Greg Stein" <gstein@lyra.org>
Sent: Sunday, October 07, 2001 8:51 PM


> On Sat, Oct 06, 2001 at 12:06:22PM -0500, William A. Rowe, Jr. wrote:
> > From: "Greg Stein" <gstein@lyra.org>
> > Sent: Saturday, October 06, 2001 11:11 AM
> >...
> > > Exactly. So why would you suggest that code that must operate within the
> > > *filter* chain be moved to reside there??
> > 
> > If you suggest creating a setaside_filter.
> 
> I'm not suggesting that. If a filter has data that it cannot use or cannot
> return at the moment, then it can set that aside for the next invocation.
> But I see that as an internal feature of a few filters.

"cannot use _or_ cannot return" [emphasis mine] ... that's double work, all filter
then need to both set aside incomplete data (if they have something to pass on)
and set aside data the caller doesn't want.  That's twice as many tests - and
unnecessary duplication of code.

> > IF a module or filter MUST not contend
> > with more than (X) bytes, then _it_ can place the setaside_filter in the chain.
> > (ap_get_client_block could always do so.)  These filters will BE RARE AND FAR 
> > BETWEEN.  Take an HTML parser.  It HAS to do it's own setaside regardless, because
> > it's the ONLY filter that understands these < > tags, and has to have complete
> > tags to operate on.
> 
> Well, I don't think the set aside will occur because it only receives "<im"
> from f->next. Instead, it will ask f->next for more data, thus forming "<img
> src='foo'>", which can be parsed.

If such a parser has formatted a bunch of data, it could choose to either grab
more data (potentially never flushing till the entire body was constructed due
to such boundries)  or pass on what it could process.  Either way, that's a
design decision we shouldn't dictate.

> set-aside only occurs when a filter retrieves or generates more data than
> the caller has asked for. gzip can easily end up doing this -- if the caller
> asks for 1k but gzip decodes 3k, then the filter will need to set aside 2k
> for a later call. If your HTML filter is, say, insert the image contents
> into the stream, then it might end up generating more content than the
> caller asked for, and will need to set aside the rest.

Right.  Duplicate code, duplicate effort.  Let one side or the other decide.

[My arguement doesn't apply to gzip transfer encoding, only gzip content encoding.
gzip transfer encoding is part of the protocol stack anyways.]

> > 99% of the time, a given filter that must set-aside doesn't have any
> > precognition of where it's setaside boundry will lay.
> 
> I'm not sure what this means. 

Sorry, I'm suggesting that most content decoding involves end tags, not length
prefixes.  Whether you are talking about html tags or multipart bodies, the
end can't be determined till such a tag is read from the body.

> Set aside occurs because a filter has more data than the caller asked for.

Only in a protocol or connection filter.  That's my entire argument, it should
_not_ be set aside by body filters.  Pass it all.

> The HTTP_IN filter defines the "end of request" boundary, and it enforces
> that by simply not requesting any data from f->next that would go past that
> boundary.

I'm not arguing about Protocol/Connection filters.  Of course your statement
makes sense for those filters.

> > We have two choices to deal with these;
> > 
> >   1. Push back the incomplete data
> > 
> >   2. Set aside our incomplete data
> 
> I can't address this; it seems to follow from your previous statement, which
> I don't understand :-)

Right.  A body filter looks at tags in the body.  Usually those are matches start
and end tags they must read.  If they get a start tag, without an end tag, then
the options I pointed out make sense.

> > Now the current argument is "Well, filter's shouldn't get anything more than they
> > bargened for".  That's not a rule for output filters, it shouldn't become one for

> > input filters.
> 
> The two filter stacks are *dramatically* different. Entirely different rules
> apply.
> 
> * the handler *pushes* data down and through the output stack, until it
>   ultimately reaches the network.
> 
> * the handler *pulls* data up and through the input stack, ultimately from
>   the network

So?  The handler can *push* too much data at a filter.  So what?  The filter must
deal with it.  I'm arguing that input content (body) filters do the same.

Output filters must deal with the content they are dealt.  Input filters should
behave the same way, excepting the protocol, transfer encoding, crypto and 
core.  These Context/Protocol filters alone negotate a fixed, known amount of
bytes.  Body filters don't (and can't) have a clue until the transfer is completed
by HTTP_IN (or some other protocol module) sending back the EOS bucket.

> ap_pass_brigade() and ap_get_brigade() have entirely different semantics,
> and we need to apply entirely different rules about filter construction as a
> result of those semantics.
>
> > If an Input filter gets more than it can handle, it can;
> 
> Faulty premise. The input filter should never *ask* for more than it can
> handle. Therefore, it will never receive too much.

Of course it will, incomplete characters, tags or other segments.  And it won't
have any control of how much it gets.  That's how output (content) filters work,
so it's how input (content) filters should work as well.

> >...
> >   3. Never be passed too much data
> >...
> > I simply argue that all output filters are -required- to handle all data passed
> > to them.  There is little good reason to change this rule for input filters.
> 
> There are reasons to have different rules for input vs output filters.
> However, this *is* one that is common. An input filter must deal with all
> the input that it receives from f->next. The simple fact is: that it knows
> what it can handle and will never ask for more than it can handle.

And there equally good reasons to have different rules for connection/protocol
filters and content/body filters.

The usual input (body) filter has no clue what's coming.  It should simply read
and parse.  Let the connection/protocol deal with the details.

> >...
> > > Handler authors simply fetch data. And our interface for that is
> > > ap_get_client_block() just like it has always been for them.
> > 
> > Right.  Doesn't change.  ap_get_client_block(), given 24kb of data, must give back
> > data to the caller one 8kb block at a time.  Another setaside, right there.  Let
this
> > one function go to that effort.
> 
> More than this one function needs to define what f->next will give it.
> Specifically, the HTTP_IN filter needs to limit what it asks for from the
> next filter.

Yes, and all those functions live at the connection/protocol layer.  None of them
apply to content.

> If a handler asks ap_get_client_block() for 8k of data, then the function
> asks the input stack for 8k of data. It isn't going to get any more than
> that, so it won't have to do any setaside. (which is fortunate because it
> doesn't have f->ctx like filters, so it would need to use request_config
> crap to give itself some global context (and I think it does today, but that
> can go away))

It may have to set aside, if we eliminate the duplication of input _and_ output 
setaside rules for content (body) filters.

And it wouldn't become duplicate code.  Who knows how many times the body is
transformed?  This is the only point past HTTP_IN that needs to care about
'throttling'.

> >...
> > > Returning what you were told to return isn't "throttling."
> > 
> > These are brigades.  The caller _rarely_ knows how much it needs or what is ideal.
> > They are already read to memory.
> 
> Actually, stuff isn't already read into memory. We read from the socket when
> people need more data from the socket.

Yes, and once it's read from the socket, you expect every content/body filter to
set aside little bits, just because the filter above thought that 8k was a nice
round number???  Only to discover that the last input filter compresses that
data back down to 5k?  foolish.

> > Can we shortcut this argument with the following logic?
> > 
> >   Connection (Protocol) filters
> >...
> >   Content (Body) filters
> >...
> 
> Input filters are input filters. There is no need to create different rules
> for them.

Of course there is, if you've been following this whole line of reasoning.
The only other sane option is to revert this entire redesign, and have HTTP_IN
as a connection filter, where it was working just fine before.

> >...
> >   They may
> >   setaside part of the content if they cannot parse it complete, to be sent on
> >   a future request when they have sufficient content to process.
> 
> Hmm. This comment makes me believe you have a mistaken view on input
> filters. The filter does not *send* data to a request [handler]. The handler
> *asks* for the data.

NO!  Really?

s/to be sent/to be passed on/

pedantic

> >...
> > > SSL: again, the caller is specifying the boundary, *not* SSL. The SSL filter
> > >      may ask f->next for an amount based on what is needed to decode a
> > >      packet (that is in the opposite direction from what you mention above).
> > 
> > No, it's bidirectional constraints.  CORE_IN knows what it can return right now,
> > SSL knows how much it needs to continue after a partial packet.
> 
> Bidirectional? No. CORE_IN is told what to return. It doesn't inherently
> know the "proper" value. It is too low level to have a grasp of what the
> caller determines to be proper.

CORE_IN knows what is already fetched from the wire (real bytes in memory).
Therefore, if the caller is willing to get everything, we will have a -1 method
of getting whatever's been read.

CORE_IN is constrained by HTTP_IN for how many bytes (line) it may reply with.  
But it's not constrained that way by SSL_IN, since that filter will live for the 
life of the connection, and will take whatever it can get, non-blocking.

SSL_IN knows initially nothing.  After it needs to complete a packet, it knows
how many bytes will complete that packet.  So it has one idea of what to ask
for, but is constrained by HTTP_IN in how much it will actually reply with.

> Simple exercise: who are *all* the callers of the CORE_IN filter? What are
> *all* of their "proper" values for their input?
>
> The obvious answer is: you cannot know [because it is a pluggable system].
> 
> Thus, CORE_IN must be *told* what to return. Thus, no bidirectional info.

Not by SSL, it doesn't.
 
> >...
> > > We ignore in both directions. The all-important piece of information is the
> > > filter asking f->next for a specific amount of data. The filter is the only
> > > person in the driver's seat; it is the only one who knows what the proper
> > > amount of data is for its particular needs, which won't screw up any
> > > potential protocol boundary.
> > 
> > Not all filters care.  The filter chain shouldn't be Constrained until we hit
> > the Connection (Protocol) end of the chain.  That's the beginning and end of
> > my entire arguement.  You are raving that "WE MUST DO THIS BECAUSE OF CONNECTIONS",
> > so you are poluting the Content stack with bogosity from the Connection stack.
> > The server's existing Content stack model had poluted the Connection stack.
> 
> I'm not raving, and I'm talking about pollution. I am stating that input
> filters should be passed a bytes-to-read amount, and that they should never
> return more than that.

Polution is requiring filters to setaside the data passed in _and_ data passed out.

> What you are talking about is something I mentioned in a previous note:
> 
> -----------------------------------------------------------------
> On Sat, 6 Oct 2001 09:11:20 -0700, Greg Stein wrote:
> > The goal is simple: input filters should never return more than what the
> > caller asks for. There should be two modes of returning data: return one
> > line or return (at most) N bytes.
> > 
> > [ based on some discussion in this thread, there can/should also be a new
> >   mode which is "return whatever you think is a Good Amount [which may also
> >   be limited by protocol considerations]". but that work can come once we
> >   get the other stuff fixed. ]
> > 
> > Another goal is to fix the signature of the input filters. That *readbytes
> > thing is way wrong.
> -----------------------------------------------------------------

Not precisely.  I'm suggesting that the 'new mode' (original behavior) should be
the <ONLY> behavior respected by body/content input filters, since any magic
number they pass is bogus at that level.  It isn't till you hit the protocol
filter (HTTP_IN) that we actually start bytecounting.  If you are just looking
for a simple limit so you get nice, clean 8k boundries, your content/body filter
should do that work itself.

> That is a second step because it is also preconditioned on getting the input
> filters fixed w.r.t Justin's fixes. Filters have not been respecting that
> parameter, but they need to. And the darn signature should be fixed along
> with a hard, formal definition of the values/combinations of the parameters.
> When we have *that*, then we have a platform for "give me a Good Amount".

You won't get SSL working correctly till we implement.

> >...
> > > Because the guy asking for 1000 bytes knows that is the request boundary and
> > > does not want any more than that. The higher level filter is in control, so
> > > screw the lower filter and what it may want to return.
> > 
> > ONE pair of filters has that dialog.  Most filters don't, and shouldn't care.
> 
> Every filter must be able to respect the caller's request for a specified
> amount. The caller may have particular requirements over how much data can
> be returned.

Who?  Past the protocol, name one reasonable example please instead of painting
broad strokes.  I've tried very hard in this discussion to name specifics.

> I think your statement can be reformulated as: "Most filters should pass
> AP_MODE_GOOD_AMOUNT, providing f->next with as much flexibility as possible."

"Most filters shouldn't have to implement unnecessary limiting on behalf of another
filter that could do that work themselves, if they actually require it."

> [ that symbol name is for exposition purposes only; should be changed ]
> 
> Now *that*, I can agree with. And it gives you an input filter stack that is
> relatively free of constraints on sizes.

You are suggesting the optimal behavior.  I'm proposing we mandate this behavior
in content (body) filters.

> But again: *EVERY* filter must obey the max-to-return parameter if it has
> been provided.

No.  That's bogus.

> >...
> > > The HTTP_IN filter was a connection filter because of that return-data
> > > thing. It would get too much data from the filter below it (which it *knew*
> > > was a socket, btw; it was coded to never call back for more data), so it had
> > > to store that into f->ctx. HTTP request parsing is part of request handling,
> > > not the connection. By having it part of the connection, it means that you
> > > cannot Upgrade the connection to a different protocol. Today, the HTTP_IN
> > > filter is a request filter and it will never ask for more than the end of
> > > the request, thus leaving the "next request" down in the connection fliters.
> > 
> > Eliminating the entire argument over how HTTP_IN, SSL_IN and CORE_IN work,
> > I don't see how any of this discussion has anything to do with Connection
> > filters.  I will say it one last time.  The new model is irrelevant and adds
> > additional (unnecessary) code to Content filters.
> 
> Filters are filters. There is a particular signature for them, and a
> particular semantic for their parameters. All filters should obey those
> semantics. That includes the possibility of limiting what can be returned. A
> filter simply cannot know what requirements its caller may have, so it
> cannot violate its request. Preferably, the callers will give f->next
> flexibility.

Then input and output filters should be defined identically.  "Oh, no, input and
output filters are designed differently for these very good reasons..."  And so?
Input connection/protocol filters are different beasts from content/body filters,
and they too need different rules, or we need to go back to the old, working and
workable model.

Bill


Mime
View raw message