lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: binary at the front of CHANGES.txt
Date Wed, 18 Jul 2007 11:16:33 GMT

On Jul 17, 2007, at 8:40 PM, Yonik Seeley wrote:

> On 7/17/07, DM Smith <dmsmith555@gmail.com> wrote:
>> According to the UTF-8 spec \uFEFF is not a BOM. In UTF-8 the byte
>> order is always the same.
>
> But there is a BOM for UTF-8 (even though there is no endian
> component, it does serve as a marker indicating the text file is
> unicode text encoded in UTF-8).
>
> http://unicode.org/faq/utf_bom.html#29

This is all rather academic at this point as you have fixed the problem.

I stand corrected \uFEFF (the code point) is the BOM for all UTF,  
with its representation differing by encoding. But UTF-8 byte order  
is always the same, regardless of the presence of the BOM.

According to the Unicode 5.0 Standard book, Chapter 13, Section 13.6,  
the byte sequence of the BOM for UTF-8 is EF BB BF (3 bytes) and for  
UTF-16 it is FE FF or FF FE (2 bytes). It appears that the byte  
sequence is unique for each unicode representation.

See http://www.unicode.org/unicode/uni2book/ch13.pdf#BOM

I frequently will see FE FF at the beginning of UTF-8 files. I have  
only seen MS editors add this. This is wrong for UTF-8 files. I was  
assuming that this was the junk at the beginning of the file.

But, the junk at the beginning of the file was C2 BF. Not at all sure  
what this would be.






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message