xerces-p-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiri Fiser <FiserJ...@seznam.cz>
Subject Bug in UTF-8 output
Date Thu, 30 Aug 2001 23:40:53 GMT
I want to use your XML Perl module (1.5.1) for processing XML document 
which are
writen in Czech (diploma thesis of my students). These documents will be
wriiten in ISO Latin 2 (Linux) or CP 1250 (Windows) encodings and I want to
transform it to UTF-8 encoding before the processing.

I have tried your XML module and in standard condition (Linux RH 7.2, 
locale = cs_CZ) it's
all OK, all strings were converted from UTF-8 to ISO Latin 2 (8859-2) 
(without
error). Unfortunately I need an output in UTF-8. When I have tried the 
locale
cs_CZ.utf8 with utf8 option in Perl 5.6.0, the output was in UTF-8, but
strings with multibyte (=2 bytes for Czech) UTF-8 characters are reduced
(shortened) on their end (probably by one character for each multibyte 
character in string
i.e. size of output string in bytes is equal to its length in characters).

I do not know where is the bug because I did'nt test utf-8 support in Perl
5.6.0 adequately (core functions [as "length"] seems OK, locale-dependend
ones as "uc" do not).

Can you help me, please.

Jiri Fiser
UJEP University, Pedagogical faculty in Usti nad Labem
Czech republic

Platorm info (perl -V):


Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
  Platform:
   osname=linux, osvers=2.2.17-8smp, archname=i386-linux
   uname='linux porky.devel.redhat.com 2.2.17-8smp #1 smp fri nov 17 
16:12:17 est 2000 i686 unknow
   config_args='-des -Doptimize=-O2 -march=i386 -mcpu=i686 -Dcc=gcc 
-Dcccdlflags=-fPIC -Dinstallpr
   hint=recommended, useposix=true, d_sigaction=define
   usethreads=undef use5005threads=undef useithreads=undef 
usemultiplicity=undef
   useperlio=undef d_sfio=undef uselargefiles=undef
   use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
  Compiler:
   cc='gcc', optimize='-O2 -march=i386 -mcpu=i686', gccversion=2.96 
20000731 (Red Hat Linux 7.1 2.
   cppflags='-fno-strict-aliasing'
   ccflags ='-fno-strict-aliasing'
   stdchar='char', d_stdstdio=define, usevfork=false
   intsize=4, longsize=4, ptrsize=4, doublesize=8  
   d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
   ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=4
   alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
   ld='gcc', ldflags =' -L/usr/local/lib'
   libpth=/usr/local/lib /lib /usr/lib  
   libs=-lnsl -ldl -lm -lc -lcrypt
   libc=/lib/libc-2.2.2.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
   dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
   cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Characteristics of this binary (from libperl):
  Compile-time options:
  Built under linux
  Compiled at Mar 23 2001 12:49:50
  @INC:
   /usr/lib/perl5/5.6.0/i386-linux
   /usr/lib/perl5/5.6.0
   /usr/lib/perl5/site_perl/5.6.0/i386-linux
   /usr/lib/perl5/site_perl/5.6.0
   /usr/lib/perl5/site_perl
   .

An example of the bug (produced by the standard sample DOMPrint.pl, the 
first occurrence in
word (attribute name) lexC)m = lex[e with accute]m, where terminal 
character "m" in
output is omited):

INPUT:

<?xml version="1.0"?>

<!DOCTYPE slovo SYSTEM "slovnik.dtd">

<slovo lexC)m="den">
  <gramatika>
   <substantivum vzor="stroj">
     <vC=jimka pC!d="6" tvar="dnu"/>
     <vC=jimka pC!d="2" DMC-slo="mnoE>nC)" tvar="dnC-"/>
   </substantivum>
  </gramatika>
  <sC)mantika>
   <vC=znam semid="24hodin">
     <vC=klad>DMasovC= C:sek 24 hodin</vC=klad>
     <ukC!zka>
       <text>mC-jC- den po dni</text>
     </ukC!zka>
   </vC=znam>
   <vC=znam semid="bC-lC=">
     <vC=klad>doba od vC=chodu do zC!padu slunce</vC=klad>
     <ukC!zka>
       <pramen>SPJ</pramen>
       <text>dlouhC= letnC- den</text>
     </ukC!zka>
     <ukC!zka>
       <text>nechval dne pEYed veDMerem</text>
     </ukC!zka>
   </vC=znam>
  </sC)mantika>
</slovo>

OUTPUT:

<?xml version="1.0"?>
<!DOCTYPE slovo SYSTEM 'slovnik.dtd' >
<slovo lexC)="den">
       <gramatika>
               <substantivum vzor="stroj">
                     <vC=jimk pC!="6" tvar="dnu" DMC-s="jednotnC"/>
                      <vC=jimk pC!="2" tvar="dnC" DMC-s="mnoE>n"/>
           </substantivum>
   </gramatika>
   <sC)mantik>
              <vC=zna semid="24hodin">
           <vC=kla>DMasovC= C:sek 24</vC=kla>
                       <ukC!zk>
                               <text>mC-jC- den po</text>
                      </ukC!zk>
               </vC=zna>
               <vC=zna semid="bC-l">
                      <vC=kla>doba od vC=chodu do zC!padu sl</vC=kla>
                       <ukC!zk>
                               <pramen>SPJ</pramen>
                           <text>dlouhC= letnC-</text>
                      </ukC!zk>
                       <ukC!zk>
                           <text>nechval dne pEYed veDM</text>
                   </ukC!zk>
                  </vC=zna>
       </sC)mantik>
</slovo>




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org


Mime
View raw message