lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: storing term text internally as byte array and bytecount as prefix, etc.
Date Tue, 02 May 2006 16:16:08 GMT
Chuck Williams wrote:
> For lazy fields, there would be a substantial benefit to having the
> count on a String be an encoded byte count rather than a Java char
> count, but this has the same problem.  If there is a way to beat this
> problem, then I'd start arguing for a byte count.

I think the way to beat it is to keep things as bytes as long as 
possible.  For example, each term in a Query needs to be converted from 
String to byte[], but after that all search computation could happen 
comparing byte arrays.  (Note that lexicographic comparisons of UTF-8 
encoded bytes give the same results as lexicographic comparisions of 
Unicode character strings.)  And, when indexing, each Token would need 
to be converted from String to byte[] just once.

The Java API can easily be made back-compatible.  The harder part would 
be making the file format back-compatible.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message