cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Zisk <sz...@mediabridge.net>
Subject Re: [Cocoon Devel] Re: How to determine encoding?
Date Fri, 01 Sep 2000 17:16:37 GMT


> >>Is there a way to determine what encoding was specified for an
> >>XML document after parsing?
>
>Ricardo,
>Did you ever work out how to do this?
>How?
>
>thanks Jeremy


Am I missing something? I would have said that if there is not explicit 
encoding information, there is no way to accurately derive the encoding. 
The ISO-8859-x character encoding definitions, Windows code pages, and even 
UTF-8 all represent the ASCII character complement using the same one-byte 
encoding as ASCII itself, so unless you propose accented character and 
language matching, how can you distinguish among any of these in a file 
when most of the characters are part of the ASCII complement?

You might have a chance distinguishing UTF-8 from the others by recognizing 
common multi-byte sequences, but for all of the one-byte encodings, most of 
the non-ASCII character codes represent meaningful characters. This is 
especially true for minor variants like ISO-8859-1 vs ISO-8859-17.

Stephen Zisk

----------
Stephen Zisk                      MediaBridge Technologies
email:  szisk@mediabridge.net     100 Nagog Park
tel:    978-795-7040              Acton, MA 01720    USA
fax:    978-795-7100              http://www.mediabridge.net


Mime
View raw message