cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject [bug] encoding problems
Date Sat, 23 Feb 2002 15:00:56 GMT
[cross posted because people on the cocoon list might hit this as well]

I've always tested xindice with english documents, so I didn't notice
this behavior until today when I imported an italian XML document.

The document is encoded using UTF-8 and looks like this:

 <?xml version="1.0" encoding="UTF-8"?>
 ...
  <subtitle>
   In sempre più film il computer con la Mela è l'arma 
   dei giusti contro criminali di ogni specie che invece 
   preferiscono i pc
  </subtitle>
 ...

[this is a news document taken from an italian on-line newspaper]

 ù -> ù
 è -> è

are the two unicode translations for the non-ASCII character (since
UTF-8 is back compatible to ASCII you don't note any difference until
you use non-ASCII letters such as these)

Opening the document in Explorer or XML-Spy yields the correct
characters.

Then I import it into the database and I access it from the cocoon
XML:DB source I get (in the explorer window):

  <?xml version="1.0" encoding="UTF-8" ?> 
   ...
  <subtitle>
   In sempre più film il computer con la Mela è l'arma dei giusti 
   contro criminali di ogni specie che invece preferiscono i pc
  </subtitle> 

same thing when opening the source from the the notepad window. But in
win2k notepad is UNICODE-aware... so I saved the source on disk and I
opened it with UltraEdit (which is UNICODE-aware but has a nice binary
view) and voila'

  ...
  <subtitle>
   In sempre più film il computer con la Mela è 
   l'arma dei giusti contro criminali di ogni specie 
   che invece preferiscono i pc
  </subtitle>
  ...

where I believe that

 Ã -> Ã
 ¹ -> ¹

This similarity in encoding probably shows why nobody noticed this
before.

So I went directly into the news.tbl and got the same bytes:

   n sempre più film il compu
   ter con la Mela è l'arma d
   ei giusti 

which clearly indicates that 'xindice' command line import tool is
somewhat ignoring the 'UTF-8' encoding and performing UTF-8 encoding on
something that is *already* UTF-8 encoded.

My perception is that there is nothing wrong in the way XIndice or
Cocoon get the information *out* of the database: the problem resides on
how the information gets *in* the database.

I would suggest the XIndice dev community to consider this bug a
showstopper for the 1.0 final release.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message