perl-modperl mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Charset in response
Date Thu, 29 Nov 2012 12:19:14 GMT
Addendum at end.

André Warnier wrote:
> Hi.
> I have a problem with a PerlResponseHandler, regarding the character set 
> used in the response to a request.
> Basically, the question is : how to I set the character set properly for 
> the "handle" used in
> $r->print("string") ?
> (where string can be "äéèöü" for example)
> Neither of the following (which I do before starting to print output) 
> seems to work :
> $r->headers_out->unset('content-type');
> $r->headers_out->set('content-type','text/html;charset=xxxx');
> or
> $r->content_type('text/html;charset=xxxx');
> When I say that it doesn't work, I mean in fact :
> - the "Content-Type" response header sent by the server is properly set 
> according to what I do above (as verified in a browser plugin)
> - but if what I print contains "accented" characters, they are not being 
> encoded properly
> So, do I need to set something else so that the $r->print(string) will 
> output "string" properly ?
> Background :
> My PerlResponseHandler reads a html file from disk, replaces some 
> strings into it, and sends the result out via $r->print.
> The source html file can be encoded in iso-8859-1 or UTF-8, and it 
> contains a proper declaration of the charset under which it is really 
> encoded :
> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
> or
> <meta http-equiv="content-type" content="text/html; charset=UTF-8">
> To read the file, I first open it "raw", read a few lines, checking for 
> the above <meta> tag.  If found, I note the charset (say in $charset), 
> close the file, and re-open it as
> open(my $fh,"<:encoding($charset)", $file);
> (note : if $charset is "UTF-8", then the open becomes
> open(my $fh,'<:utf8', $file);)
> I also at that point set the response charset by one of the means above.
> Then I read the file line by line, substituting some strings in the 
> line, and print out the line via
> $r->print($line);
> etc..
> My problem is that, if the input file is for example iso-8859-1 and 
> contains the word "Männer", the output comes out as "M(A tilde)(some 
> byte)nner" (the bytes corresponding to the UTF-8 encoding of the "a 
> umlaut").
> Can I / should I do something like
> binmode($r,":$charset"); # ??

Addendum : I added some logging to the ResponseHandler as follows :

PARAM: while (defined($line = <$form_fh>)) {

	if ($Debug > 1) {
		$r->log->warn(" input line is [$line], utf8 flag : " . (Encode::is_utf8($line) ? "y"

The corresponding line in the log, for a line containing the word "männlich", is :

[Thu Nov 29 10:34:37 2012] [warn] [client]  input line is [\t\t\t\t<input

name="ANSPR" type="radio" value="m" id="ANSPR">&nbsp;m\xc3\xa4nnlich\n], utf8 flag
: y

Of course, as is usual in the type of case, one never knows how the logfile itself is 
But it does confirm that, as read in the Handler, the string is properly encoded 
internally in perl, with the utf8 flag set.
However, when I look in the result as received by the browser,
- the browser says that the page received is encoded as iso-8859-1
- the browser's "view page source" confirms that this character is (incorrectly) 
represented by 2 bytes :
	<input name="ANSPR" type="radio" value="m" id="ANSPR">&nbsp;männlich

View raw message