perl-embperl mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hartmaier Alexander" <Alexander.Hartma...@t-systems.at>
Subject AW: Little Embperl UTF-8 HOWTO
Date Thu, 25 Nov 2004 17:22:15 GMT
Hi!

Superb Torsten!

I played around with UTF-8 last week and it was way less hassle than I thought ;-)

My environment looks like this:

LANG=de_AT.UTF-8@euro
LD_LIBRARY_PATH=/u01/app/oracle/product/10.1.0/client_1/lib
NLS_LANG=AMERICAN_AMERICA.AL32UTF8
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=/u01/app/oracle/product/10.1.0/client_1
ORA_NLS33=/u01/app/oracle/product/10.1.0/client_1/nls/data


I didn't use 3. and 4. till now and just changed using 3 as well.

@5: I use oracle10g, DBD-Oracle 1.16 and DBI 1.46 and UTF-8 support works out-of-the-box.
My Oracle is configured to use AMERICAN_AMERICA.AL32UTF8.
My Oracle even uses 4bytes for 1char (in the 'worst' case).

@7: With my setup I didn't need to convert anything to make UTF-8 work. Oracle passes UTF-8,
Embperl doesn't touch it (with epchar.c.min) and the browser selects the right encoding.
My default header for every page looks like this:

[$ sub page_xhtml$]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8" />
[$ endsub $]

The <meta> flag is needed by IE to choose UTF-8 encoding (maybe not if ' AddDefaultCharset
UTF-8' is configured in apache2).

@8: i also use utf8::decode() for $fdat (but only for vars that need to be converted).

Thanks for the mini-howto Torsten and @Gerald maybe you can add that to the Embperl documentation
as well.

Alex


-----Ursprüngliche Nachricht-----
Von: Torsten Luettgert [mailto:t.luettgert@pressestimmen.de] 
Gesendet: Donnerstag, 25. November 2004 17:49
An: embperl@perl.apache.org
Betreff: Little Embperl UTF-8 HOWTO

Hi,

in contrast to my usual reasons for posting, this mail is not for
whining about some Embperl bug, but because I wanted to wrap up what one
has to do in order to use Embperl in database-driven web applications
with UTF-8. Perhaps it'll help someone. It certainly took me a while to
figure all this out.


1. Versions
Use the latest Embperl release. ATM this is Embperl-2.0rc2.
You also need a recent perl. It should work from 5.8.1 on (because that
one has "use utf8", see below), but I only know for sure that 5.8.3
works.


2. Embperl compilation
Before doing the usual "perl Makefile.PL; make; make install UNINST=1",
copy the file epchar.c.min over epchar.c.
Because if you don't, in some cases special characters are quoted and
you see garbage instead of the special character.


3. Configure your apache (I just assume you're using apache 2.0.x) to
UTF-8 as default character set with the directive

AddDefaultCharset UTF-8


4. Source code in UTF-8
Write all of your perl files in UTF-8. If you're using vim, you can
convert your source files by opening them, doing ":se fileencoding=utf8"
and saving them again.
You should also tell Perl that the source is UTF-8 with the line

use utf8;

Regrettably, that pragma only marks the current lexical scope as being
UTF-8. I had to insert [- use utf8; -] at the beginning of each and
every Embperl source file I have. A bit unelegant, but I didn't find
a better solution (why it is important for Perl to know that input is
UTF-8 is explained in 6. "the utf-8 flag").


5. Your database
One could probably write big books about databases and UTF-8. I used
PostgreSQL, mysql and DB2. mysql and PostgreSQL allow to set an encoding
at database creation time:

createdb --encoding=utf8 dbname             (pg)
CREATE DATABASE dbname CHARACTER SET utf8;  (mysql)
DB2 is blissfully utf-8-unaware.

Be warned: Except for PostgreSQL, if you have a
CREATE TABLE test (
  ministr varchar(2)
);
you can store 2 BYTES in ministr, NOT 2 characters! Meaning,
you could store 'ab', 'ä', but not 'äh'. And the euro symbol '€' is
too big for ministr!
PostgreSQL allows 2 characters, so even '€€' would fit.


6. The utf-8 flag
Perl now distinguishes between characters and bytes (which makes sense
considering that "€" is 3 bytes, but 1 char). It also attaches a flag to
every string, the utf-8 flag, which tells if the string is in Perl's
"internal format" (which is UTF-8).
You REALLY want this flag set on all of your strings that are not pure
ASCII, because strings are not 'equal' if the flag value differs, even
if the string without utf-8 flag contains the same bytes.
Literal strings have this flag automatically set if you did "use utf8;".

If you're not sure if a string has the flag, use the DBI function
"data_string_desc". Here's an example:

#!/usr/bin/perl
$a = "äöü";
use utf8;
$b = "äöü";
if( $a eq $b ) {
  print "Strings are equal.\n";
}else{
  print "Strings are NOT equal.\n";
}
use DBI;
print "a: ".DBI::data_string_desc($a)."\n";
print "b: ".DBI::data_string_desc($b)."\n";

This prints the following:

Strings are NOT equal.
a: UTF8 off, non-ASCII, 6 characters 6 bytes
b: UTF8 on, non-ASCII, 3 characters 6 bytes

Which is pretty much self-explanatory.
Strings are converted into the internal format with

utf8::decode($string);

Yes, that's decode, not encode, because from Perl's point of
view, the character ENCODING (even if it is the same as Perl uses
internally) is undone and it is converted to the internal format.
Oh, and a little caveat: only decode strings ONCE. Everything else
is asking for trouble. To be sure about that, use

utf8::decode($string) unless utf8::is_utf($string);

If you need to convert strings from other encodings, use the Encode
module. Example for latin-1:

use Encode;
$internal_string = Encode::decode('iso-8859-1', $string);


7. DBI and the utf-8 flag
If you're connecting to a database, you probably use the DBI module
(and if you don't, you probably should :-)
Strings coming from DBI have, at the time being, the utf-8 flag NOT set
(at least in DBI version 1.46 which I'm using). It is on their TODO
list, though.

Since you really want that flag on, as explained in 6., you need to
convert every string from the database to Perl's internal format with
the flag on (if it's not pure ASCII). I was lucky in this, because I use
the same wrapper functions for SQL access everywhere, and only had to
add utf8::decode() to them (as explained in 6.)


8. Embperl and the utf-8 flag
Strings stored to %udat keep their utf-8 flag, so you don't need to
worry about that.
%fdat is different, however (Gerald has it on the post-2.0 TODO list).
Here also applies what I said in 7. %fdat can be converted by doing

foreach my $k (keys %fdat) {
  utf8::decode($fdat{$k});
}

That's about all I found out about on my journey to Embperl with UTF-8.
For more documentation, see

perldoc utf8
perldoc Encode
perldoc perlunicode

Greetings,
Torsten


---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
Hinweis: Dieses E-mail kann vertrauliche und geschtzte Informationen enthalten.
Sollten Sie nicht der beabsichtigte Empfnger sein, verstndigen Sie bitte den Absender und
lschen Sie dieses E-mail dann sofort.

Notice: This e-mail contains information that is confidential and may be privileged.
If you are not the intended recipient, please notify the sender and then delete this e-mail
immediately.
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
Mime
View raw message