lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 18:15:13 GMT
I've been looking around... do you have a pointer to the source where just 
the suffix is converted from UTF-8?

I understand the index format, but I'm not sure I understand the problem 
that would be posed by the prefix length being a byte count.

-Yonik Now hiring -- http://tinyurl.com/7m67g

On 8/30/05, Doug Cutting <cutting@apache.org> wrote:
> 
> tjones@apache.org wrote:
> > How will the difference impact String memory allocations? Looking at
> > the String code, I can't see where it would make an impact.
> 
> I spoke a bit too soon. I should have looked at the code first. You're
> right, I don't think it would require more allocations.
> 
> When considering this byte-count versus character-count issue please
> note that it also arises elsewhere. The PrefixLength in the Term
> Dictionary section of the file format document is currently defined as a
> number of characters, not bytes.
> 
> http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary
> 
> Implementing this in terms of bytes may have performance implications,
> since, at first glance, the entire byte sequence would need to be
> converted from UTF-8 into the internal string representation for each
> term, rather than just the suffix. Does anyone see a way around that?
> 
> As for how we got to this point: I wrote Lucene's UTF-8 reading and
> writing code in 1998, back when Unicode still had fewer than 2^16
> characters. It's surprising that it has lasted this long without anyone
> noticing!
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message