lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Flexible indexing design
Date Tue, 15 Apr 2008 19:03:55 GMT

On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote:

> I think the major difference is locality?  In a compound file, you
> have to seek "far away" to reach the prx & skip data (if they are
> separate).

There's another item worth mentioning, something that Doug, Grant and  
I discussed when this flexible indexing talk started way back when.   
When you unify frq/prx data into a single file, phrase queries and the  
like benefit from improved locality, but simple term queries are  
impeded because needless positional data must be plowed through.

We dismissed that cost with the assertion that you could specify a  
match-only field for simple queries if that was important to you, but  
IME that doesn't seem to be very practical.  It's hard for the  
internals to know that they should prefer one field over another based  
on the type of query, and hard to manually override everywhere.

> This is like "column stride" vs "row stride" serialization
> of a matrix.
> Relatively soon, though, we will all be on SSDs, so maybe this
> locality argument becomes far less important ;)

Yes, I've thought about that.  It defeats the phrase-query locality  
argument for unified postings files and recommends breaking things up  
logically by type of data into frq/prx/payload/whatever.

Would it be possible to design a Posting plugin class that reads from  
multiple files?  I'm pretty sure the answer is yes.  It messes up the  
single-stream readRecord() method I've been detailing and might force  
Posting to maintain state.  But if Postings are scarce TermBuffer- 
style objects where values mutate, rather than populous Term-style  
objects where you need a new instance for each set of values, then it  
doesn't matter if they're large.

If that could be done, I think it would be possible to retrofit the  
Posting/PostingList concept into Lucene without a file format change.   

> Does KS allow non-compound format?

No, it doesn't.

> I would think running out of file descriptors is common problem  
> otherwise.

The default per-process limit for file descriptors on OS X is 256.   
Under the Lucene non-compound file format, you're guaranteed to run  
out of file descriptors eventually under normal usage.  If KS allowed  
a non-compound format, you'd also be guaranteed to run out of file  
descriptors, just sooner.  Since not failing at all is the only  
acceptable outcome, there's not much practical difference.

I think there's more to be gained from tweaking out the VFS than in  
accommodating a non-compound format.  Saddling users with file  
descriptor constraint worries and having to invoke ulimit all the time  

>> My conclusion was that it was better to exploit the benefits of  
>> bounded,
>> single-purpose streams and simple file formats whenever possible.
>> There's also a middle way, where each *format* gets its own file.   
>> Then you
>> wind up with fewer files, but you have to track field number state.
>> The nice thing is that packet-scoped plugins can be compatible with  
>> ALL of
>> these configurations:
> Right.  This way users can pick & choose how to put things in the
> index (with "healthy" defaults, of course).

Well, IMO, we don't want the users to have to care about the container  

Under the TermDocs/TermPositions model, every time you add new data,  
you need to subclass the containers.  Under the PostingList model, you  
don't -- Posting plugs in.

For KS at least, the primary goal is to make Posting public and as  
easy to subclass as possible -- because a public Posting plugin class  
seems to me to be the easiest way to add custom flexible indexing  
features like like text payloads, or arbitrary integer values used by  
custom function queries, or other schemes not yet considered.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message