lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Realtime Search
Date Wed, 24 Dec 2008 19:03:00 GMT
On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote:
> Seems doubtful you will be able to do this without increasing the  
> index size dramatically. Since it will need to be stored  
> "unpacked" (in order to have random access), yet the terms are  
> variable length - leading to using a maximum=minimum size for every  
> term.

Wow.  That's a spectacularly awful design.  Its worst case -- one outlier
term, say, 1000 characters in length, in a field where the average term length
is in the single digits -- would explode the index size and incur wasteful IO
overhead, just as you say.

Good thing we've never considered it.  :)

I'm hoping we can improve on this, but for now, we've ended up at a two-file
design for the term dictionary index.

  1) Stacked 64-bit file pointers.
  2) Variable length character and term info data, interpreted using a 
     pluggable codec.

In the index at least, each entry would contain the full term text, encoded as
UTF-8.  Probably the primary term dictionary would continue to use string

That design offers no significant benefits other than those that flow from
compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage
under multiple processes by way of buffer sharing.  IO bandwidth requirements
and speed are probably a little better, but lookups on the term dictionary
index are not a significant search-time bottleneck.

Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in 

  1) Stacked 64-bit file pointers.
  2) Character data.
  3) Doc num to ord mapping.

Marvin Humphrey

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message