httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul J. Reder" <>
Subject Re: Mod_include design
Date Fri, 03 Nov 2000 17:08:14 GMT
Wow, I'm a little overwhelmed here. After a dearth of comments or input I took
my eyes off my mail for 1/2 a day and get flooded. Please allow me to 
apologize up front for the length of this post. Allow me to try to catch
up and comment on the topics brought up...

The listed states in the first post were simply my initial take on states.
They acted as a place holder in the design for future parsing states. As
for whether tag state is required: My feeling is that it is safer and more
accurate checking a state variable instead of trying to make an inference
from a pointer being NULL/non-NULL. I also feel that keeping state during
the tag parsing is much more efficient. Parsing the tag is more than doing 
5 byte comparisons. Consider the following pathological case:

"<", "!", "-", "-", "#" each in their own brigades.

Ryan suggested that I just squirrel away the buckets into a set aside
brigade in the ctx then restart the tag checking at the beginning each
time mod_include is called.

This means that I re-iterate through not just a couple of byte compares for
each set aside bucket, but also the code required to read from the buckets,
and track compared positions. This isn't a lot of overhead, but it is more
than just 5 byte compares, and is easily avoided.

By the time I am done there are likely to be a number of other states used
during the directive parsing. These additional states will never be visible
across invocations of mod_include because they are only entered once the
full tag has been found. These states could be stored locally, but I feel
keeping all of the state information in one place makes more sense.

In summary: The number of states will grow to include parsing states. Tag state
is useful for code readability (state == X vs. brgd_ptr != NULL) and for 
efficiency. And for crying out loud, the state variables are just an int and
a few pointers.

Max Buffer Size:
In 1.3 Apache allowed each token to be a max of HUGE_STRING_LEN (8192 bytes).
The code did not read the whole directive in, it read bytes from what it
viewed as a file stream. I would guess that most tags would be less than 
1KB total length, with some going more.

My feeling is to have a default buffer of say 512 or 1024, then alloc a larger
one if required (see next comment section about copying). This would be a
VERY local buffer only existing during the life of the send_parsed_content
function. It could be allocated and freed safely within that function, or
allocated from the request pool and cleaned up automagically later. I like
the local alloc and free better personally.

This will probably send Ryan into convulsions ;), but after studying the code
and having conversations with a couple of people and now having received 
unsolicited agreement from Greg, I feel that making a copy once is the only
reasonable way to go.

Granted, Ryan is correct that the bucket handling code is contained in a couple
of functions. The problem is that these functions get called frequently and that
these function do a copy anyway. And not just a copy but a pool alloc and copy
for the values. So I don't think there can be any debate about copying. It is
currently happening and there is no way to completely remove copying. It would
be impossible, for example, to execute the setting of a variable with a value
that spans buckets.

What I suggest is to make one copy. There is still debate about whether to copy
as you go or copy once you have the whole thing. My feeling is that copying once
you have the whole thing allows you to determine how much space is required so
you can either use the local buffer or allocate a bigger one. This is an important
point since we really have no idea how big the thing can get, and we certainly
have no idea how big it is until we find the end.

I also advocate copying only the directive content (not the the start and end
delimiters). This provides a minor code simplification. The old code used the
end delimiter to mark the end of the stream for parsing purposes. Since we
already know where the end of the tag is, there is no sense complicating the
parsing code looking for the ending sequence again.

I also advocate not ever copying any part of the directive again. The tolower'ing
can be done in place. The values can be referenced in place. Every token has
a clear delimiter (white space, "=", "\0", etc.). These delimiters can be converted to 
"\0" and pointers set to reference the bytes in place. During handle_set, for example,
the value gets apr_pstrdup'ed during the apr_table_setn call anyway.

So in summary, once the full tag has been obtained (and the directive bytes counted)
a buffer may be allocated and the directive bytes copied from the buckets into the
buffer. The parser will return pointers into this buffer after having marked the
end of the token with a NULL (tolower'ing as required). If the value must exist
outside of this buffer (i.e. for a set) then the value can be dup'ed as required.
One optimization would be to check if the whole directive is contained in one bucket
and just use that buffer instead of copying at all.

I am working on a complete documented grammar for this. I will post it when it
is mostly complete. This post is already long enough so I will spare you all for now.

Paul J. Reder
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein

View raw message