httpd-apreq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Wichitill <ma...@gmx.de>
Subject Re: Apache::Request, APR::Table and UTF8
Date Thu, 07 Oct 2004 18:28:46 GMT
David Wheeler wrote:
> Yes, if APR::Table is designed specifically to support US-ASCII only, 
> then it seem pointless to add this support. I mean, even if the utf8 
> flag could be maintained, if the underlying Apache C code doesn't 
> properly handle anything over the 256th byte because it doesn't support 
> it (and so, theoretically, what it does with bytes over 256 is 
> unpredictable), 

Make sure to patent bytes > 256, sounds like there could be a lot of money 
in those. SCNR ;)

Nah, those bytes would be treated the same as normal 8-bit characters (which 
aren't allowed in headers either), that's the whole point of UTF-8. If 
Apache is doing any length checks, it wouldn't count the length in logical 
characters, but it wouldn't asplode or anything.

But I think that neither APR::Table nor Apache::Request need UTF-8 support, 
it's easy enough to wrap/subclass those functions in do exactly what you 
want in those wrappers.

Just don't stuff your own strings into APR::Tables, they're part of the 
Apache API, which doesn't directly support UTF-8. They're not a Perl utility 
class for general use, which indeed should support storing any special flags.

Some people might want to use utf8::decode on the input, others are afraid 
of utf8::decode's experimental status and prefer Encode::decode_utf8, or are 
crazy enough to call the faster Encode::_utf8_on (which might crash the 
server or send it into an infinite loop if the data wasn't valid UTF-8). 
Others again might want to use Unicode::Normalize::NFC on all parameter 
strings after decoding. Too many options here for a one-size-fits-all 
approach. Unicode is not simple.

The rather vague "$something" variables in the thread starter post even 
sound like something like Encode::Guess might be needed, which is even more 
outside of the responsibility of Apache::Request. The developer and specific 
application has to know the encoding, heuristic scanning in Apache::Request 
would slow and error-prone.

Regarding performance, using wrapper functions and/or storing your 
parameters in a Perl hash might have a tiny performance hit, but "tiny" is 
the keyword here, because who really uses more than half a dozen parameters 
most of the time. I mainly use Apache::Request for the better performance 
and less wasted memory with handling MFD uploads, not because of a 
nanosecond more or less when parsing a query string.

On a related note, I was hoping for UTF-8 support in DBI, because I thought 
calling utf8::decode for hundreds of fields in hashes returned by DBI might 
be more of a performance issue than. But I've experimented with that since 
then, and even calling utf8::decode for no less than ~700 fields only 
reduced the overall performance from around 25 req/s (with all UTF-8 
processing enabled) to 24 req/s (with all text being treated as 8-bit) in my 
test case, and that also includes the total overhead of UTF-8 in Perl, in a 
forum application that does lots of regex operations on those Unicode 
strings. Pretty impressive I think.

A simple form of UTF-8 support in Apache::Request that I wouldn't mind would 
be a flag "DECODE_UTF8 => 1", that when passed to new() would cause 
Apache::Request to call utf8::decode on the returned string every time 
param(), body() etc. is called, but would leave the original data/tables 
alone. That might be convenient when writing small handlers that don't 
warrant a library with wrapper functions etc. But I don't think that's what 
Boris wants.

Mime
View raw message