lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Standard or Modified UTF-8?
Date Fri, 26 Aug 2005 23:51:27 GMT

As part of my attempt to speed up Plucene and establishing index  
compatibility between Plucene and Java Lucene, I'm porting  
InputStream and OutputStream to XS (the C API for accessing Perl's  
guts), and I believe I have found a documentation bug in the file- 
format spec at...

"Lucene writes unicode character sequences using the standard UTF-8  

Snooping the code in OutputStream, it looks like you are writing  
modified UTF-8 -- NOT standard -- because a null byte is written  
using the two-byte form.

       else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
     writeByte((byte)(0xC0 | (code >> 6)));
     writeByte((byte)(0x80 | (code & 0x3F)));

Can someone please confirm that the intention is to write modified  

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message