lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <>
Subject Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Wed, 26 Mar 2008 21:31:16 GMT
On Wed, Mar 26, 2008 at 5:22 PM, Michael McCandless
<> wrote:
>  > Are the string diffs and comparisons now performed against raw
>  > bytes, so that fewer conversions are needed?
>  Alas, not yet: Lucene still uses UTF16 java chars internally.  The
>  conversion to UTF-8 happens "at the last minute" when writing, and
>  "immediately" when reading.
>  I started exploring keeping UTF-8 bytes further in, but it quickly
>  got messy because it would require changing how the term infos are
>  sorted to be unicode code point order.  Comparing bytes in UTF-8 is
>  the same as comparing unicode code points, which is nice.  But
>  comparing UTF-16 values is almost but not quite the same.   So
>  suddenly everywhere where a string comparison takes place I had to
>  assess whether that comparison should be by unicode code point, and
>  call our own method for doing so.  It quickly became a "big" project
>  so I ran back to sorting by UTF-16 value.

Hmmm, can't we always do it by unicode code point?
When do we need UTF-16 order?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message