lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible indexing design
Date Sat, 12 Apr 2008 21:26:06 GMT

On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:

> Can't you compartmentalize while still serializing skip data into the
> single frq/prx file?

Yes, that's possible.

The way KS is set up right now, PostingList objects maintain i/o  
state, and Posting's Read_Record() method just deals with whatever  
instream gets passed to it.  If the PostingList were to sneak in the  
reading of a skip packet, the Posting would be none the wiser.

> This as analagous to how videos are encoded.  EG the AVI file format
> is a "container" format, and in contains packets of video and packets
> of audio, interleaved at the right rate so a player can play both in
> sync.  The "container" has no idea how to decode the audio and video
> packets.  Separate codecs do that.
>
> Taking this back to Lucene, there's a container format that, using
> TermInfo, knows where the frq/prx data (packet) is and where the skip
> data (packet) is.  And it calls on separate decoders to decode each.

This is an intriguing proposal.  :)

The dev branch of KS currently uses oodles of per-segment files for  
the lexicon and the postings:

   * One postings file per field per segment.      [SEGNAME-FIELDNUM.p]
   * One lexicon file per field per segment.       [SEGNAME- 
FIELDNUM.lex]
   * One lexicon index file per field per segment. [SEGNAME- 
FIELDNUM.lexx]

Having so many files is something of a drawback, but it means that  
each individual file can be very specialized, and that yields numerous  
benefits:

   * Each file has a simple format.
   * File Format spec easier to write and understand.
   * Formats are pluggable.
       o Easy to deprecate.
       o Easy to isolate within a single class.
   * PostingList objects are always single-field.
       o Simplified internals.
           * No field numbers to track.
           * Repeat one read operation to scan the whole file.
       o Pluggable using subclasses of Posting.
       o Fewer subclasses (e.g. SegmentTermPositions is not needed).
   * Lexicon objects are always single-field.
       o Simplified internals.
           * No field numbers to track.
           * Repeat one read operation to scan the whole file.
       o Possible to extend with custom per-field sorting at index-time.
       o Easier to extend to non-text terms.
           * Comparisons ops guaranteed to see like objects.
   * Stream-related errors are comparatively easy to track down.

Some of these benefits are preserved when reading from a single  
stream.  However, there are some downsides:

   * Container classes like PostingList more complex.
       o No longer single-field.
       o Harder to detect overruns that would have been EOF errors.
       o Easier to lose stream sync.
       o Periodic sampling for index records more complex.
           * Tricky to prevent inappropriate compareTo ops at  
boundaries.
   * Harder to troubleshoot.
       o Glitch in one plugin can manifest as an error somewhere else.
       o Hexdump nearly impossible to interpret.
       o Mentally taxing to follow like packets in an interleaved  
stream.
   * File corruption harder to recover from.
       o Only as reliable as the weakest plugin.

Benefits of the single stream include:

   * Fewer hard disk seeks.
   * Far fewer files.

If you're using Lucene's non-compound file format, having far fewer  
files could be a significant benefit depending on the OS.  But here's  
the thing:

If you're using a custom virtual file system a la Lucene's compound  
files, what's the difference between divvying up data using filenames  
within the CompoundFileReader object, and divvying up data downstream  
in some other object using some ad hoc mechanism?

My conclusion was that it was better to exploit the benefits of  
bounded, single-purpose streams and simple file formats whenever  
possible.

There's also a middle way, where each *format* gets its own file.   
Then you wind up with fewer files, but you have to track field number  
state.

The nice thing is that packet-scoped plugins can be compatible with  
ALL of these configurations:

> This way we can decouple the question of "how many files do I store my
> things in" from "how is each thing encoded/decoded".  Maybe I want
> frq/prx/skip all in one file, or maybe I want them in 3 different  
> files.
>>

Well said.

>> The second problem is how to share a term dictionary over a  
>> cluster.  It
>> would be nice to be able to plug modules into IndexReader that  
>> represent
>> clusters of machines but that are dedicated to specific tasks: one  
>> cluster
>> could be dedicated to fetching full documents and applying  
>> highlighting;
>> another cluster could be dedicated to scanning through postings and
>> finding/scoring hits; a third cluster could store the entire term  
>> dictionary
>> in RAM.
>>
>> A centralized term dictionary held in RAM would be particularly  
>> handy for
>> sorting purposes.  The problem is that the file pointers of a term
>> dictionary are specific to indexes on individual machines.  A shared
>> dictionary in RAM would have to contain pointers for *all* clients,  
>> which
>> isn't really workable.
>>
>> So, just how do you go about assembling task specific clusters?   
>> The stored
>> documents cluster is easy, but the term dictionary and the postings  
>> are
>> hard.
>
> Phew!  This is way beyond what I'm trying to solve now :)

Hmm.  It doesn't look that difficult from my perspective.  The problem  
seems reasonably well isolated and contained.  But I've worked hard to  
make KS modular, so perhaps there's less distance left to travel.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message