lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Lucene does NOT use UTF-8
Date Mon, 29 Aug 2005 19:16:50 GMT

Eric Hatcher wrote...

> What, if any, performance impact would changing Java Lucene in this  
> regard have?

And Ken Krugler wrote...

> "Lucene writes strings as a VInt representing the length of the  
> string in Java chars (UTF-16 code units), followed by the character  
> data."

I had been working under the assumption that the value of the VInt  
would be changed as well.  It seemed logical that if strings were  
encoded with legal UTF-8, the count at the head should indicate  
either 1) the number of UTF-8 characters in the string, or 2) the  
number of bytes occupied by the encoded string.

Do either of those and more substantial changes to Java Lucene would  
be required.  I expect that the impact on performance could be made  
negligible for the first option, but the question of backwards  
compatibility would become a lot messier.

It simply had not occurred to me to keep the VInt as is.  If you do  
that, this becomes a much more localized problem.

For Plucene, I'll avoid the gory details and just say that having the  
VInt continue to represent UTF-16 code units limits the availability  
of certain options, but doesn't cause major inefficiencies.  Now that  
we know that's what it does, we can work with it.  A transition to  
always-legal UTF-8 obviates the need to scan for and fix the edge  
cases, and addresses my main concern.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message