lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Hacking Luke for bytecount-based strings
Date Wed, 17 May 2006 18:08:15 GMT
Marvin Humphrey wrote:
> What I'd like to do is augment my existing patch by making it  possible 
> to specify a particular encoding, both for Lucene and Luke.

What ensures that all documents in fact use the same encoding?

The current approach of converting everything to Unicode and then 
writing UTF-8 to indexes makes indexes portable and simplifies the 
construction of search user interfaces, since only indexing code needs 
to know about other character sets and encodings.

If a collection has invalidly encoded text, how does it help to detect 
that later rather than sooner?

> Searches 
> will continue to work regardless because the patched  Termbuffer 
> compares raw bytes. (A comparison based on Term.compareTo () would 
> likely fail because raw bytes translated to UTF-8 may not  produce the 
> same results.)

UTF-8 has the property that bytewise lexicographic order is the same as 
Unicode character order.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message