lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Hacking Luke for bytecount-based strings
Date Wed, 17 May 2006 23:01:52 GMT

On May 17, 2006, at 2:04 PM, Doug Cutting wrote:

>> Detecting invalidly encoded text later doesn't help  anything in  
>> and of itself; lifting the requirement that everything be   
>> converted to Unicode early on opens up some options.
> How useful are those options?  Are they worth the price?   
> Converting to unicode early permits one to, e.g., write encoding- 
> independent tokenizers, stemmers, etc.  That seems like a lot to  
> throw away.

Fair enough.  For Java Lucene, the main benefits of encoding  
flexibility would accrue when A) your material takes up a lot more  
space in UTF-8 than in another alternative, or B) you prefer a native  
encoding to Unicode, most often because of the Han unification  

The space issue could be addressed by allowing UTF-16 as an  
alternative.  Catering to arbitrary encodings doesn't offer that much  
benefit for the price, though your perspective on that may differ if  
you're, say, Japanese.

>>> UTF-8 has the property that bytewise lexicographic order is the   
>>> same as Unicode character order.
>> Yes.  I'm suggesting that an unpatched TermBuffer would have  
>> problems  with my index with corrupt character data because the  
>> sort order by  bytestring may not be the same as sort order by  
>> Unicode code point.
> I think you're saying that bytewise comparisons involving invalid  
> UTF-8 may differ from comparisons of the unicode code points they  
> represent. But if they're invalid, they don't actually represent  
> unicode code points, so how can they be compared?

Repairing an invalid Unicode sequence, whether UTF-8, UTF-16BE, or  
other, generally means swapping in U+FFFD "REPLACEMENT CHARACTER",   
provided that you don't throw a fatal error.  U+FFFD has a numeric  
value which affects sort order, and the swap may also affect the  
length of the sequence.

More generally, if you map from valid data in another encoding to  
Unicode, lexical sorting of the source bytestring and lexical sorting  
of the Unicode target will often produce differing results.

It really messes up a TermInfosReader to have terms out of sequence.   
And unfortunately I misremembered how the cached Terms from the  
auxiliary term dictionary get compared -- those use term.compareTo 
(otherTerm) rather than termBuffer.compareTo(otherTermBuffer).  The  
patched version of Lucene doesn't change that, so if an invalidly  
encoded term with a replacement character happens to fall on an index  
point, bad things will happen.

That means the current patch is inadequate for dealing with  
KinoSearch 0.05 or Ferret indexes unless the application developer  
forced UTF-8 at index-time.  I'd need to make additional changes in  
order to guarantee that a patched Luke would work -- TermInfosReader  
would need to cache the bytestrings and compare those instead.   
That's effectively what KinoSearch does.

That's probably a good idea anyway, as it cuts down the RAM  
requirements for caching the Term Infos Index -- so long as your data  
occupies less space as a bytestring than as Java chars.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message