lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian chen" <chenjian1...@gmail.com>
Subject Re: storing term text internally as byte array and bytecount as prefix, etc.
Date Tue, 02 May 2006 18:15:27 GMT
Hi, Doug,

I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.

Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe this change to standard UTF-8 could be a hot item on the Lucene 2.0list?

Cheers,

Jian Chen

On 5/2/06, Doug Cutting <cutting@apache.org> wrote:
>
> Chuck Williams wrote:
> > For lazy fields, there would be a substantial benefit to having the
> > count on a String be an encoded byte count rather than a Java char
> > count, but this has the same problem.  If there is a way to beat this
> > problem, then I'd start arguing for a byte count.
>
> I think the way to beat it is to keep things as bytes as long as
> possible.  For example, each term in a Query needs to be converted from
> String to byte[], but after that all search computation could happen
> comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
> encoded bytes give the same results as lexicographic comparisions of
> Unicode character strings.)  And, when indexing, each Token would need
> to be converted from String to byte[] just once.
>
> The Java API can easily be made back-compatible.  The harder part would
> be making the file format back-compatible.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message