lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: caching term information?
Date Thu, 18 May 2006 19:56:08 GMT

On May 18, 2006, at 10:43 AM, Robert Engels wrote:

> Has anyone thought of (or implemented) caching of term information?
> Currently, Lucene stores an index of every nTH term. Then uses this
> information to position the TermEnum, and then scans the terms.
> Might it be better to read a "page" of term infos (based on the  
> index), and
> then keep these pages in a SoftCache in the SegmentTermEnum ?

I'd thought about just making it possible to load up the whole Term  
Dictionary.  Dangerous for large indexes, but interesting.  The  
Google 98 paper indicates that they got their whole dictionary into RAM.

The thing about caching pages of the dictionary is that I don't think  
that heavily searched terms will be concentrated in one page, so it  
would probably get swapped a lot.  I'm not familiar with SoftCache,  

KinoSearch currently caches SegmentTermEnum entries as bytestrings,  
or more accurately "ByteBuf" C structs modeled on Java's ByteBuffer  
which are basically an array of char, a length, and a capacity.  Each  
bytestring consists of the field number as a big endian 16-bit int,  
followed by the term text.  Since field numbers in KinoSearch are  
forced to correspond to lexically sorted field name, those sort  

The ByteBufs don't take up a lot of space, and they could be even  
smaller if they used VInts for field number.  If we load everything  
up, then locating a term in the .tis file can be achieved with a  
binary search.  Pay RAM to buy speed.

It might also make sense to just load up the raw .tis file into RAM.   
That would require even less memory, and would eliminate the disk  
seeks, but would still have to be traversed linearly and decompressed.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message