perl-asp mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arnon Weinberg <ar...@back2front.ca>
Subject Re: Output character encoding
Date Fri, 15 Jun 2012 04:34:16 GMT

Thanks very much Josh for investigating this - it saved me some time 
narrowing down the issue. Even still, I did spend quite a lot of time 
working out a solution for my needs, and still I don't think it is 
generalizable as-is. However, in case someone else wants to give it a 
crack, I provide details below.

On 2012-06-05 19:30, Josh Chamas wrote:
> doing this is where we have a problem:
>
> <% print Encode::decode('ISO-8859-1',"\xE2"); %>
>
> and immediately in the Apache::ASP::Response::Write() method the data 
> has already been converted incorrectly

The fact that such a simple use of Encode causes an issue is a little 
surprising. Surely others are using Apache::ASP in multi-language 
environments - is no one using Encode this way? How are others coping 
with this limitation right now?

> Its as if by merely going through the tied interface that data goes 
> through some conversion process.

Not quite, as the same results happen without a tie'd interface. The 
"use bytes" pragma is what causes the conversion (see test script below).

> Apache::ASP::Response does a "use bytes" which is to deal with the 
> output stream correctly I believe this is around content length 
> calculations.
> I think this is fine here, and turning this off makes things worse for 
> these examples.

It looks like "use bytes" is now deprecated and should indeed be 
removed. The documentation doesn't mention any trivial substitute. 
However, this pragma mostly just overrides some built-in functions with 
byte-oriented versions. So I made the following changes to Response.pm:
- changed use bytes => no bytes (just import the namespace)
- changed all occurrences of length() => bytes::length()
This resolved the mixed-encoding issue originally posted, but introduced 
a new (more manageable) issue.

For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internal 
flag that indicates that a string has a known decoding). This flag 
should be transparent in principle, but it helped make sense of the 
behaviour of Apache::ASP.
Results of testing are summarized as follows:

1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same 
results with the "use bytes" pragma turned on:
- For any string with the UTF-8 flag off, output is correctly encoded.
- Any string with the flag on is (double-)encoded as UTF-8, regardless 
of the actual output encoding.
2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results:
- The UTF-8 flag does not affect output - it is correctly encoded in 
every case.
- However, an interesting test case is that of the double-encoding 
problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This 
case is indicative of bad code, so is not a concern here, but it 
illustrates how a tie'd filehandle differs from plain STDOUT. In this 
case, a single "wide character" double-encodes the entire output (with 
buffering on, this can be the entire page), instead of just the string.
- These test cases are demonstrated by the script below.
3. Testing Apache::ASP with "no bytes" produces different results from 
the command-line (asp-perl) version, as well as different results from 
Perl/CGI running on Apache. This suggests an interaction effect between 
Apache and Apache::ASP (both are required to produce these results).
- With the UTF-8 flag off, output is correctly encoded as before.
- However, with "no bytes", Apache::ASP, and the UTF-8 flag on, the 
entire output is double-encoded. This result is similar to the 
double-encoding problem in the previous test case, except that it 
doesn't require a "wide character" - any string with the UTF-8 flag on 
will do.

This test script demonstrates all but the last test case:

#!/usr/bin/perl

use Encode;

foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" )
{
print "$_: ";
tie *FH, $_ if ! /^S/;
my $STDOUT = select ( FH ) if ! /^S/;
print "\x{263a}",
Encode::decode('ISO-8859-1',"\xE2"),
"\xE2";
print "\n";
close ( FH ) if ! /^S/;
select ( $STDOUT ) if ! /^S/;
}

use strict;

package tie_use_bytes;
use bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

package tie_no_bytes;
no bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

# Output: ##################

Wide character in print at ...
STDOUT: ☺ââ # STDOUT output is correct in all cases
tie_use_bytes: ☺ââ # with "use bytes", the UTF-8-flagged 2nd character 
is double-encoded
Wide character in print at ...
tie_no_bytes: ☺ââ # with "no bytes", the output is correct, but a 
"wide character" double-encodes the entire string because of the way the 
tie'd file handle is implemented

#########################

By the way, if it's getting difficult to wrap your head around this, 
you're not alone.

At this point, I peeked at the $Response->{out} data buffer, and could 
see that it was encoded correctly. However, the output from Apache (when 
the UTF-8 flag is on) was not correct, suggesting that Apache is doing 
something to encode the string in this case.
I decided therefore to address the problem by turning off the UTF-8 
flag. The most fault-tolerant method I managed to come up with to do 
this was the following:

${$Response->{BinaryRef}}
= Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}},
sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } )
if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) );

which can go at the top of the $Response->Flush() method, or in 
global.asa/Script_OnFlush().

With this solution I can now modify Apache::ASP's output encoding (eg, 
using binmode ( STDOUT );), as originally desired, and the output 
appears correct in all my test cases.


-- 
-------------------------------------------------------------------------------
Arnon Weinberg
www.back2front.ca


---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Mime
View raw message