perl-modperl mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Charset in response
Date Thu, 29 Nov 2012 16:10:25 GMT
André Warnier wrote:
> André Warnier wrote:
>> Torsten Förtsch wrote:
>>> On 11/29/2012 10:37 AM, André Warnier wrote:
>>>> When I say that it doesn't work, I mean in fact :
>>>> - the "Content-Type" response header sent by the server is properly set
>>>> according to what I do above (as verified in a browser plugin)
>>>> - but if what I print contains "accented" characters, they are not 
>>>> being
>>>> encoded properly
>>>>
>>>> So, do I need to set something else so that the $r->print(string) will
>>>> output "string" properly ?
>>>>
>>>>
>>>> Background :
>>>>
>>>> My PerlResponseHandler reads a html file from disk, replaces some
>>>> strings into it, and sends the result out via $r->print.
>>>> The source html file can be encoded in iso-8859-1 or UTF-8, and it
>>>> contains a proper declaration of the charset under which it is really
>>>> encoded :
>>>>
>>>> <meta http-equiv="content-type" content="text/html; 
>>>> charset=iso-8859-1">
>>>> or
>>>> <meta http-equiv="content-type" content="text/html; charset=UTF-8">
>>>>
>>>> To read the file, I first open it "raw", read a few lines, checking for
>>>> the above <meta> tag.  If found, I note the charset (say in $charset),
>>>> close the file, and re-open it as
>>>>
>>>> open(my $fh,"<:encoding($charset)", $file);
>>>>
>>>> (note : if $charset is "UTF-8", then the open becomes
>>>> open(my $fh,'<:utf8', $file);)
>>>
>>> So, you convert the octet stream into a character stream when you read
>>> the file. You have to do the reverse when you write it.
>>
>> I have to, to be able to be consistent in my string-replacement logic.
>>
>>>
>>>   $r->print(Encode::encode $encoding, $string);
>>>
>>> Modperl usually uses perlio. So, perl-script handler should be able to
>>> push an encoding layer on top of the :Apache2 layer.
>>>
>>>   binmode STDOUT, ':encoding(...)'
>>>
>>> But I haven't tried this yet.
>>>
>>> Now, that I think of it, perhaps even the following would work
>>>
>>>   open my $fh, '>:Apache2:encoding(...)', $r;
>>>   print $fh $string;
>>>
>>> If it does not work it would be good to make it so.
>>>
>>
>> I'll try the above and let you know.
>>
>> I guess that if I can do
>> open my $fh, '>:Apache2:encoding(...)', $r;
>> then $r, under the hood, must be some kind of filehandle too.
>> And then I could just do
>> binmode($r,":encoding($charset)");
>> but then, this being mod_perl, it may leave it that way and have 
>> unexpected side-effects somewhere else..
>>
>>
> 
> Results :
> 
> 1) using : open my $fh, '>:Apache2:encoding(...)', $r;
> 
> (Note: I can't find Apache2::encoding anywhere.  Was that a typo ?)
> 
>     $logger->warn("$pfx: reading form using encoding [$enc]") if $Debug>1;
> ...
>     my $response_fh;
>     unless (open ($response_fh,">:$enc",$r)) {
>         $logger->error("$pfx Cannot open \$r : $?");
>         return Apache2::Const::SERVER_ERROR;
>     }
> 
> brings server error and logs :
> 
> [Thu Nov 29 15:48:42 2012] [warn] [client 192.168.245.129] 
> AM::SendForm::response: reading form using encoding [encoding(iso-8859-1)]
> [Thu Nov 29 15:48:42 2012] [error] [client 192.168.245.129] 
> AM::SendForm::response Cannot open $r : 0
> 
> 2) using : binmode STDOUT, ':encoding(...)'
> 
>     $logger->warn("$pfx: reading form using encoding [$enc]") if $Debug>1;
> ...
>     binmode(STDOUT,":$enc");
> ...
>         $logger->warn(" input line is [$line], utf8 flag : " . 
> (Encode::is_utf8($line) ? "y" : "n"));
> ...
>     $r->print($line);
> ...
> 
> does not bring server error and outputs the page, but apparently has no 
> effect (characters are still wrong) :
> 
> [Thu Nov 29 15:55:52 2012] [warn] [client 192.168.245.129]  input line 
> is [\t\t\t\t<input name="ANSPR" type="radio" value="m" 
> id="ANSPR">&nbsp;m\xc3\xa4nnlich\n], utf8 flag : y
> 
> (in the response also)
> 
> 3) same as (2), but using simple "print $line;" instead of 
> "$r->print($line);"
> 
> That is very bizarre.  It runs through the code for many lines.  It 
> still prints the one "Männlich" line wrong (in the log and in the html 
> output as well):
> client 192.168.245.129]  input line is [\t\t\t\t<input name="ANSPR" 
> type="radio" value="m" id="ANSPR">&nbsp;m\xc3\xa4nnlich\n], utf8 flag : y
> 
> but now in addition, it crashes a few lines further with a server error 
> and this in the log :
> 
> [Thu Nov 29 16:01:45 2012] [warn] [client 192.168.245.129]  input line 
> is [<tr><td>&nbsp;</td></tr>\n], utf8 flag : y
> [Thu Nov 29 16:01:45 2012] [error] [client 192.168.245.129] "\\x{4bae}" 
> does not map to iso-8859-1 at 
> /usr/local/lib/apache2/perllib/AM/SendForm.pm line 203, <$form_fh> line 
> 101.\n
> 
> The line 101 of the input form is as shown in the log just before the 
> error :
> <tr><td>&nbsp;</td></tr>
> 
> and the next line is a simple
> <tr>
> 
> I have examined the form with a UTF-8 capable editor, and I see no extra 
> bizarre characters anywhere near. I have no idea where this ""\\x{4bae}" 
> could be coming from.
> 
> 4) trying : $r->print(Encode::encode $encoding, $string);
> 
> as : $r->print(Encode::encode($charset,$line));
> 
> Bingo !
> 
> It still prints in the log :
> [Thu Nov 29 16:21:42 2012] [warn] [client 192.168.245.129]  input line 
> is [\t\t\t\t<input name="ANSPR" type="radio" value="m" 
> id="ANSPR">&nbsp;m\xc3\xa4nnlich\n], utf8 flag : y
> 
> But it outputs it correctly in the response document sent to the browser :
>                 <input name="ANSPR" type="radio" value="m" 
> id="ANSPR">&nbsp;männlich
> 
> and it also doesn't choke on the line on which it choked before :
> [Thu Nov 29 16:21:42 2012] [warn] [client 192.168.245.129]  input line 
> is [<tr><td>&nbsp;</td></tr>\n], utf8 flag : y
> [Thu Nov 29 16:21:42 2012] [warn] [client 192.168.245.129]  input line 
> is [<tr>\n], utf8 flag : y
> 
> This works, but does not seem to be very efficient. It makes an 
> additional call to a function at each output line.
> I don't know though how this compares to when it's perlio who encodes 
> the output.
> 

Addendum : the forms are not that big, and the calls to forms are not that frequent, so I

can perfectly live with this solution.
I would still like to know what is the reason why none of the other methods work though.

One other thing which puzzles me : the "read-and-replace" part of the code of this 
ResponseHandler is basically extracted from the code of a previous cgi-bin script, which 
did the same. In that cgi-bin script, I was basically doing this :

my $cgi = CGI->new();
...
then reading the file the same way and substituting strings the same way, then
..
$cgi->print($line);

and there was never a charset problem on output.
So why is there one here ?

Mime
View raw message