lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: bytecount as prefix
Date Sat, 06 May 2006 16:31:33 GMT
On Sat, May 06, 2006 at 05:11:02PM +0900, David Balmain wrote:
> Hi Marvin,
> Where are you with this? I also have a vested interest in seeing
> Lucene move to using byte counts. I was wondering if I could help out.
> Is the patch you pasted here the latest you have?

All I've added since then is debugging code.  Including some last night.

As I mentioned in another thread, this is going to be a multi-stage
process.  The goal of that first patch is to have Lucene using
bytecounts everywhere (except for TermVectors, just because it isn't
strictly necessary).   Lucene will be slower after it is [fixed,
completed and] applied.  

The next stage will involve finding optimizations to return Lucene to at
least its prior speed.  The primary target is segment merger.

Looking ahead, it will be interesting to see how many advantages of
working with term text as bytestrings can be realized.  Lazy loading of
fields should be an obvious winner.   The cached .tii in TermInfosReader
could potentially occupy a lot less RAM if your text takes up less space
in UTF-8 than in chars.  And it becomes theoretically possible to have
Lucene use an arbitrary encoding for character data in the index, rather
than only UTF-8.

The intended mechanics of that patch should be plain enough.  I'm going
to take another crack at seeing what's wrong with it today.  If somebody
beats me to a solution, I won't complain. :)

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message