Return-Path: Delivered-To: apmail-httpd-apreq-dev-archive@www.apache.org Received: (qmail 64301 invoked from network); 7 Oct 2004 18:28:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 7 Oct 2004 18:28:58 -0000 Received: (qmail 2470 invoked by uid 500); 7 Oct 2004 18:28:58 -0000 Delivered-To: apmail-httpd-apreq-dev-archive@httpd.apache.org Received: (qmail 2365 invoked by uid 500); 7 Oct 2004 18:28:57 -0000 Mailing-List: contact apreq-dev-help@httpd.apache.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Delivered-To: mailing list apreq-dev@httpd.apache.org Received: (qmail 2350 invoked by uid 99); 7 Oct 2004 18:28:57 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of mawic@gmx.de designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.28) with SMTP; Thu, 07 Oct 2004 11:28:56 -0700 Received: (qmail 13855 invoked by uid 65534); 7 Oct 2004 18:28:53 -0000 Received: from dsl-082-082-185-090.arcor-ip.net (EHLO [82.82.185.90]) (82.82.185.90) by mail.gmx.net (mp025) with SMTP; 07 Oct 2004 20:28:53 +0200 X-Authenticated: #20142289 Message-ID: <41658ADE.6080407@gmx.de> Date: Thu, 07 Oct 2004 20:28:46 +0200 From: Markus Wichitill User-Agent: Mozilla Thunderbird 0.8 (Windows/20040913) X-Accept-Language: en, de MIME-Version: 1.0 To: apreq-dev@httpd.apache.org Subject: Re: Apache::Request, APR::Table and UTF8 References: <455F1FE1-1139-11D9-A745-000D9331B488@2bz.de> <87is9p16ht.fsf@gemini.sunstarsys.com> <6E61FA29-16EC-11D9-BC00-000A95B9602E@kineticode.com> <95D96DBE-171F-11D9-9D34-000D9331B488@2bz.de> <416326AF.8040502@stason.org> <46F5EF7F-1725-11D9-B147-000A95B9602E@kineticode.com> <4163346D.8000501@stason.org> <10D4B4DB-172C-11D9-B147-000A95B9602E@kineticode.com> <41633F79.6080003@stason.org> <87k6u4x7c0.fsf@gemini.sunstarsys.com> <4164BC19.4030802@stason.org> <48369F3D-187D-11D9-A063-000A95B9602E@kineticode.com> In-Reply-To: <48369F3D-187D-11D9-A063-000A95B9602E@kineticode.com> X-Enigmail-Version: 0.86.1.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N David Wheeler wrote: > Yes, if APR::Table is designed specifically to support US-ASCII only, > then it seem pointless to add this support. I mean, even if the utf8 > flag could be maintained, if the underlying Apache C code doesn't > properly handle anything over the 256th byte because it doesn't support > it (and so, theoretically, what it does with bytes over 256 is > unpredictable), Make sure to patent bytes > 256, sounds like there could be a lot of money in those. SCNR ;) Nah, those bytes would be treated the same as normal 8-bit characters (which aren't allowed in headers either), that's the whole point of UTF-8. If Apache is doing any length checks, it wouldn't count the length in logical characters, but it wouldn't asplode or anything. But I think that neither APR::Table nor Apache::Request need UTF-8 support, it's easy enough to wrap/subclass those functions in do exactly what you want in those wrappers. Just don't stuff your own strings into APR::Tables, they're part of the Apache API, which doesn't directly support UTF-8. They're not a Perl utility class for general use, which indeed should support storing any special flags. Some people might want to use utf8::decode on the input, others are afraid of utf8::decode's experimental status and prefer Encode::decode_utf8, or are crazy enough to call the faster Encode::_utf8_on (which might crash the server or send it into an infinite loop if the data wasn't valid UTF-8). Others again might want to use Unicode::Normalize::NFC on all parameter strings after decoding. Too many options here for a one-size-fits-all approach. Unicode is not simple. The rather vague "$something" variables in the thread starter post even sound like something like Encode::Guess might be needed, which is even more outside of the responsibility of Apache::Request. The developer and specific application has to know the encoding, heuristic scanning in Apache::Request would slow and error-prone. Regarding performance, using wrapper functions and/or storing your parameters in a Perl hash might have a tiny performance hit, but "tiny" is the keyword here, because who really uses more than half a dozen parameters most of the time. I mainly use Apache::Request for the better performance and less wasted memory with handling MFD uploads, not because of a nanosecond more or less when parsing a query string. On a related note, I was hoping for UTF-8 support in DBI, because I thought calling utf8::decode for hundreds of fields in hashes returned by DBI might be more of a performance issue than. But I've experimented with that since then, and even calling utf8::decode for no less than ~700 fields only reduced the overall performance from around 25 req/s (with all UTF-8 processing enabled) to 24 req/s (with all text being treated as 8-bit) in my test case, and that also includes the total overhead of UTF-8 in Perl, in a forum application that does lots of regex operations on those Unicode strings. Pretty impressive I think. A simple form of UTF-8 support in Apache::Request that I wouldn't mind would be a flag "DECODE_UTF8 => 1", that when passed to new() would cause Apache::Request to call utf8::decode on the returned string every time param(), body() etc. is called, but would leave the original data/tables alone. That might be convenient when writing small handlers that don't warrant a library with wrapper functions etc. But I don't think that's what Boris wants.