lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ronald Dauster <>
Subject Re: Lucene does NOT use UTF-8
Date Mon, 29 Aug 2005 10:02:13 GMT
Erik Hatcher wrote:

> On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>>> I'm not familiar with UTF-8 enough to follow the details of this
>>> discussion.  I hope other Lucene developers are, so we can resolve  
>>> this
>>> issue.... anyone raising a hand?
>> I could, but recent posts makes me think this is heading towards a  
>> religious debate :)
> Ken - you mentioned taking the discussion off-line in a previous  
> post.  Please don't.  Let's keep it alive on java-dev until we have a  
> resolution to it.
I'd also like to follow this thread.

>> I think the following statements are all true:
>> a. Using UTF-8 for strings would make it easier for Lucene indexes  
>> to be used by other implementations besides the reference Java  version.
>> b. It would be easy to tweak Lucene to read/write conformant UTF-8  
>> strings.
> What, if any, performance impact would changing Java Lucene in this  
> regard have?   (I realize this is rhetorical at this point, until a  
> solution is at hand)
Looking at the source of 1.4.3, fixing the NUL character encoding is 
trivial for writing and reading already works for both the standard and 
the java-style encoding. Not much work and absolutely no performance 
impact here.

The surrogate pair problem is another matter entirely. First of all, 
lets see if I do understand the problem correctly: Some unicode 
characters can be represented by one codepoint outside the BMP (i. e., 
not with 16 bits) and alternatively with two codepoints, both of them in 
the 16-bit range. According to Marvin's explanations, the Unicode 
standard requires these characters to be represented as "the one" 
codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that 

But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
range cannot be represented as chars.  That is, the 
in-memory-representation still requires the use of the surrogate pairs.  
Therefore, writing consists of translating the surrogate pair to the 
 >16bit representation of the same character and then algorithmically 
encoding that.  Reading is exactly the reverse process.

Adding code to handle the 4 to 6 byte encodings to the 
readChars/writeChars method is simple, but how do you do the mapping 
from surrogate pairs to the chars they represent? Is there an algorithm 
for doing that except for table lookups or huge switch statements?

>> c. The hard(er) part would be backwards compatibility with older  
>> indexes. I haven't looked at this enough to really know, but one  
>> example is the compound file (xx.cfs) format...I didn't see a  
>> version number, and it contains strings.
> I don't know the gory details, but we've made compatibility breaking  
> changes in the past and the current version of Lucene can open older  
> formats, but only write the most current format.  I suspect it could  
> be made to be backwards compatible.  Worst case, we break  
> compatibility in 2.0.
I believe backward compatibility is the easy part and comes for free.  
As I mentioned above, reading the "correct" NUL encoding already works 
and the non-BMP characters will have to be represented as surrogate 
pairs internally anyway.  So there is no problem with reading the old 
encoding and there is nothing wrong with still using or reading the 
surrogate pairs, only that they would not be written. Even indices with 
mixed segments are not a problem. 

Given that the CompoundFileReader/Writer use a for their FileEntries, they would 
also be able to read older files but potentially write incompatible 
files.  OTOH, when used inside lucene, the filenames do not contain NULs 
of non-BMP chars.

But: Is the compound file format supposed to be "interoperable"? Which 
formats are?

> [...]
>> What's unclear to me (not being a Perl, Python, etc jock) is how  
>> much easier it would be to get these other implementations working  
>> with Lucene, following a change to UTF-8. So I can't comment on the  
>> return on time required to change things.
>> [...]
> PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
> SWIG), so no worries there.  CLucene would need to be changed, as  
> well as DotLucene and the other ports out there.
> If the rest of the world of Lucene ports followed suit with PyLucene  
> and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
> disadvantages to following this model with Plucene?
Some parts of the Lucene API require subclassing (e. g., Analyzer) and 
SWIG does support cross-language polymorphism only for a few languages, 
notably Python and Java but not for Perl. Noticing the smiley I won't 
mention the zillion other reasons not to use the "GCJ/SWIG thing".


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message