cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leszek Gawron <>
Subject Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF
Date Thu, 09 Dec 2004 08:49:53 GMT
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>> ...By the way: it is a little bit different on win32. Some tools 
>> detect utf encoding by checking for BOM. If there is none - ANSI 
>> encoding is assumed...
> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
> -Bertrand
Even though UTF-8 does not need a BOM to indicate endianness, Microsoft 
Notepad began prepending a BOM to its UTF-8 text files. Actually, it is 
a conversion of U+FEFF to an encoding as UTF-8 serialized bytes: EF BB 
BF (or in 4GL: CHR(15711167)). There is some value in the BOM being used 
as a file signature, indicating the plain text file is encoded as 
Unicode UTF-8, as opposed to some other code page. That particular 
3-byte sequence is unlikely to represent data in any other code page, 
given the text is supposed to be human readable in some language. 
However, there is some small possibility that it represents some string 
in some code page... Because Microsoft did it, and there is so much 
Notepad data out there, the UTF-8 BOM became a de facto standard and 
then a de jure standard. (Although the BOM is optional.)

M$ again.

Leszek Gawron                            
Project Manager                                    MobileBox sp. z o.o.
+48 (61) 855 06 67                    
mobile: +48 (501) 720 812                       fax: +48 (61) 853 29 65

View raw message