lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Flexible indexing design
Date Sun, 13 Apr 2008 09:35:19 GMT
Marvin Humphrey <> wrote:
>  On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:
> > Can't you compartmentalize while still serializing skip data into the
> > single frq/prx file?
> >
> Yes, that's possible.
> The way KS is set up right now, PostingList objects maintain i/o state, and
> Posting's Read_Record() method just deals with whatever instream gets passed
> to it.  If the PostingList were to sneak in the reading of a skip packet,
> the Posting would be none the wiser.

Got it.

> > This as analagous to how videos are encoded.  EG the AVI file format
> > is a "container" format, and in contains packets of video and packets
> > of audio, interleaved at the right rate so a player can play both in
> > sync.  The "container" has no idea how to decode the audio and video
> > packets.  Separate codecs do that.
> >
> > Taking this back to Lucene, there's a container format that, using
> > TermInfo, knows where the frq/prx data (packet) is and where the skip
> > data (packet) is.  And it calls on separate decoders to decode each.
> >
> This is an intriguing proposal.  :)
> The dev branch of KS currently uses oodles of per-segment files for the
> lexicon and the postings:
>   * One postings file per field per segment.      [SEGNAME-FIELDNUM.p]
>   * One lexicon file per field per segment.       [SEGNAME-FIELDNUM.lex]
>   * One lexicon index file per field per segment. [SEGNAME-FIELDNUM.lexx]
> Having so many files is something of a drawback, but it means that each
> individual file can be very specialized, and that yields numerous benefits:
>   * Each file has a simple format.
>   * File Format spec easier to write and understand.
>   * Formats are pluggable.
>       o Easy to deprecate.
>       o Easy to isolate within a single class.
>   * PostingList objects are always single-field.
>       o Simplified internals.
>           * No field numbers to track.
>           * Repeat one read operation to scan the whole file.
>       o Pluggable using subclasses of Posting.
>       o Fewer subclasses (e.g. SegmentTermPositions is not needed).
>   * Lexicon objects are always single-field.
>       o Simplified internals.
>           * No field numbers to track.
>           * Repeat one read operation to scan the whole file.
>       o Possible to extend with custom per-field sorting at index-time.
>       o Easier to extend to non-text terms.
>           * Comparisons ops guaranteed to see like objects.
>   * Stream-related errors are comparatively easy to track down.
> Some of these benefits are preserved when reading from a single stream.
> However, there are some downsides:
>   * Container classes like PostingList more complex.
>       o No longer single-field.
>       o Harder to detect overruns that would have been EOF errors.
>       o Easier to lose stream sync.
>       o Periodic sampling for index records more complex.
>           * Tricky to prevent inappropriate compareTo ops at boundaries.
>   * Harder to troubleshoot.
>       o Glitch in one plugin can manifest as an error somewhere else.
>       o Hexdump nearly impossible to interpret.
>       o Mentally taxing to follow like packets in an interleaved stream.
>   * File corruption harder to recover from.
>       o Only as reliable as the weakest plugin.
>  Benefits of the single stream include:
>   * Fewer hard disk seeks.
>   * Far fewer files.
> If you're using Lucene's non-compound file format, having far fewer files
> could be a significant benefit depending on the OS.  But here's the thing:
> If you're using a custom virtual file system a la Lucene's compound files,
> what's the difference between divvying up data using filenames within the
> CompoundFileReader object, and divvying up data downstream in some other
> object using some ad hoc mechanism?

I think the major difference is locality?  In a compound file, you
have to seek "far away" to reach the prx & skip data (if they are
separate).  This is like "column stride" vs "row stride" serialization
of a matrix.

Relatively soon, though, we will all be on SSDs, so maybe this
locality argument becomes far less important ;)

Does KS allow non-compound format?  I would think running out of
file descriptors is common problem otherwise.  Though, I think your
fibonacci merge policy is more "aggressive" than Lucene's
LogMergePolicy (ie, fewer segments for the same # docs).

> My conclusion was that it was better to exploit the benefits of bounded,
> single-purpose streams and simple file formats whenever possible.
> There's also a middle way, where each *format* gets its own file.  Then you
> wind up with fewer files, but you have to track field number state.
> The nice thing is that packet-scoped plugins can be compatible with ALL of
> these configurations:

Right.  This way users can pick & choose how to put things in the
index (with "healthy" defaults, of course).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message