Mailing-List: contact apreq-dev-help@httpd.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (hermes.apache.org: domain of mawic@gmx.de designates
 213.165.64.20 as permitted sender)
Message-ID: <41658ADE.6080407@gmx.de>
Date: Thu, 07 Oct 2004 20:28:46 +0200
From: Markus Wichitill <mawic@gmx.de>
User-Agent: Mozilla Thunderbird 0.8 (Windows/20040913)
MIME-Version: 1.0
To: apreq-dev@httpd.apache.org
Subject: Re: Apache::Request, APR::Table and UTF8
References: <455F1FE1-1139-11D9-A745-000D9331B488@2bz.de>
 <87is9p16ht.fsf@gemini.sunstarsys.com>
 <6E61FA29-16EC-11D9-BC00-000A95B9602E@kineticode.com>
 <95D96DBE-171F-11D9-9D34-000D9331B488@2bz.de> <416326AF.8040502@stason.org>
 <46F5EF7F-1725-11D9-B147-000A95B9602E@kineticode.com>
 <4163346D.8000501@stason.org>
 <10D4B4DB-172C-11D9-B147-000A95B9602E@kineticode.com>
 <41633F79.6080003@stason.org> <87k6u4x7c0.fsf@gemini.sunstarsys.com>
 <4164BC19.4030802@stason.org>
 <48369F3D-187D-11D9-A063-000A95B9602E@kineticode.com>
In-Reply-To: <48369F3D-187D-11D9-A063-000A95B9602E@kineticode.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

David Wheeler wrote:
> Yes, if APR::Table is designed specifically to support US-ASCII only, 
> then it seem pointless to add this support. I mean, even if the utf8 
> flag could be maintained, if the underlying Apache C code doesn't 
> properly handle anything over the 256th byte because it doesn't support 
> it (and so, theoretically, what it does with bytes over 256 is 
> unpredictable), 

Make sure to patent bytes > 256, sounds like there could be a lot of money 
in those. SCNR ;)

Nah, those bytes would be treated the same as normal 8-bit characters (which 
aren't allowed in headers either), that's the whole point of UTF-8. If 
Apache is doing any length checks, it wouldn't count the length in logical 
characters, but it wouldn't asplode or anything.

But I think that neither APR::Table nor Apache::Request need UTF-8 support, 
it's easy enough to wrap/subclass those functions in do exactly what you 
want in those wrappers.

Just don't stuff your own strings into APR::Tables, they're part of the 
Apache API, which doesn't directly support UTF-8. They're not a Perl utility 
class for general use, which indeed should support storing any special flags.

Some people might want to use utf8::decode on the input, others are afraid 
of utf8::decode's experimental status and prefer Encode::decode_utf8, or are 
crazy enough to call the faster Encode::_utf8_on (which might crash the 
server or send it into an infinite loop if the data wasn't valid UTF-8). 
Others again might want to use Unicode::Normalize::NFC on all parameter 
strings after decoding. Too many options here for a one-size-fits-all 
approach. Unicode is not simple.

The rather vague "$something" variables in the thread starter post even 
sound like something like Encode::Guess might be needed, which is even more 
outside of the responsibility of Apache::Request. The developer and specific 
application has to know the encoding, heuristic scanning in Apache::Request 
would slow and error-prone.

Regarding performance, using wrapper functions and/or storing your 
parameters in a Perl hash might have a tiny performance hit, but "tiny" is 
the keyword here, because who really uses more than half a dozen parameters 
most of the time. I mainly use Apache::Request for the better performance 
and less wasted memory with handling MFD uploads, not because of a 
nanosecond more or less when parsing a query string.

On a related note, I was hoping for UTF-8 support in DBI, because I thought 
calling utf8::decode for hundreds of fields in hashes returned by DBI might 
be more of a performance issue than. But I've experimented with that since 
then, and even calling utf8::decode for no less than ~700 fields only 
reduced the overall performance from around 25 req/s (with all UTF-8 
processing enabled) to 24 req/s (with all text being treated as 8-bit) in my 
test case, and that also includes the total overhead of UTF-8 in Perl, in a 
forum application that does lots of regex operations on those Unicode 
strings. Pretty impressive I think.

A simple form of UTF-8 support in Apache::Request that I wouldn't mind would 
be a flag "DECODE_UTF8 => 1", that when passed to new() would cause 
Apache::Request to call utf8::decode on the returned string every time 
param(), body() etc. is called, but would leave the original data/tables 
alone. That might be convenient when writing small handlers that don't 
warrant a library with wrapper functions etc. But I don't think that's what 
Boris wants.