Mailing-List: contact apreq-dev-help@httpd.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (hermes.apache.org: domain of mawic@gmx.de designates
 213.165.64.20 as permitted sender)
Message-ID: <4165C338.8010203@gmx.de>
Date: Fri, 08 Oct 2004 00:29:12 +0200
From: Markus Wichitill <mawic@gmx.de>
User-Agent: Mozilla Thunderbird 0.8 (Windows/20040913)
MIME-Version: 1.0
To: Boris Zentner <bzm@2bz.de>
CC: apreq-dev@httpd.apache.org
Subject: Re: Apache::Request, APR::Table and UTF8
References: <455F1FE1-1139-11D9-A745-000D9331B488@2bz.de>
 <87is9p16ht.fsf@gemini.sunstarsys.com>
 <6E61FA29-16EC-11D9-BC00-000A95B9602E@kineticode.com>
 <95D96DBE-171F-11D9-9D34-000D9331B488@2bz.de> <416326AF.8040502@stason.org>
 <46F5EF7F-1725-11D9-B147-000A95B9602E@kineticode.com>
 <4163346D.8000501@stason.org>
 <10D4B4DB-172C-11D9-B147-000A95B9602E@kineticode.com>
 <41633F79.6080003@stason.org> <87k6u4x7c0.fsf@gemini.sunstarsys.com>
 <4164BC19.4030802@stason.org>
 <48369F3D-187D-11D9-A063-000A95B9602E@kineticode.com>
 <41658ADE.6080407@gmx.de> <41658FE1.5050502@modperlcookbook.org>
 <D72EFAA1-189F-11D9-9D34-000D9331B488@2bz.de>
In-Reply-To: <D72EFAA1-189F-11D9-9D34-000D9331B488@2bz.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Boris Zentner wrote:
>>> A simple form of UTF-8 support in Apache::Request that I wouldn't mind
>>> would be a flag "DECODE_UTF8 => 1", that when passed to new() would
>>> cause Apache::Request to call utf8::decode on the returned string every
>>> time param(), body() etc. is called, but would leave the original
> 
> This is not possible. The reason is that if you call decode for every 
> parameter, *ALL* parameters must be in utf8. That is not true.

As I said, I know this simple all-or-nothing approach is not the solution 
you want, but it's certainly "possible" and a solution that would be enough 
for most applications. I don't know what kind of mixed input your 
application receives in a single request, but in the common case of a 
browser-submitted form, it's either all UTF-8 or not, depending on the 
encoding of the HTML page (or maybe the accept-charset attribute, if any 
browsers support that).

My stance here is that your application with mixed input is rather specific 
if not unusual, and therefore needs to do its own handling of the issue, 
even if that means you have to do more wrapping/subclassing and have to 
educate your co-developers about how to use the resulting interfaces. Which 
you already did.

A light-weight library like apreq should not try to handle every possible 
format under the sun, that's why I've already cautioned against linking in 
full XML support.

> Also I can think of the DECODE_UTF8 flag from your example as a 
> utf8-flag-on-for-all parameters in the table.

Technically utf8::decode doesn't set the flag for 7-bit strings.

> I understand, that APR::Table can not change it's current behavior. But 
> Apache::Request can and should. I'm really frightened that so much 
> emails and examples do not convert you all on how important correct data 
> for perl is.

We seem to mainly differ about whose responsibility it is to handle the 
UTF-8 issue, the developer's who has all the relevant info, or that of a 
thin XS layer over Apache internals, which doesn't know anything about the 
context of the incoming data.

BTW, I think you make things look a bit too simple in your first post by 
hiding much of the real complexity behind $something and $something_else. 
I'm still not sure if you expect Apache::Request to do heuristic scanning of 
the input to automatically determine the encoding (which would be unreliable 
at best)?

> Again, no conversion just back what I put in that is the minimum 
> requirement for any data-store. 

Stop thinking of those tables as data stores, they're APIs to a webserver 
that doesn't really support Unicode. Of course I might be biased, since I've 
always treated Apache::Request as a read-only object anyway.