incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Snapshot
Date Tue, 24 Mar 2009 11:57:48 GMT
Marvin Humphrey <> wrote:
> Greets,
> As Lucy indexes are modified, they will move forward in discrete steps, each
> of which will present a coherent point-in-time view of the index data.
> Generically speaking, such point-in-time views of data are often called
> "snapshots".
> Since index files, once written, are never modified, a list of all the files
> included in a "snapshot" is sufficient to describe it completely.

But what about other per-segment data that you might want to store?
EG Lucene now stores deletion count per segment in the segments file.

Does "format" declare the snapshot file's format, only?  Or is it a
system-wide format, including versioning of all binary segment files
as well.

> I propose that the master file which defines the snapshot be named the
> "snapshot" file, that its primary purpose be to provide a list of files, that
> it be encoded as human-readable JSON, and that we publish a public class named
> Lucy::Index::Snapshot to control it.

Will it do the same write-once lockless approach (snapshot_N) that Lucene does?

> In Lucene, the file which defines the snapshot is the binary "segments" file,
> which contains a list of the active segments, along with some metadata
> describing the characteristics of each segment.  The files which make up the
> snapshot are implied by the segment names, though the association isn't
> perfect: for instance, the "doc store" files contain data which may be
> referenced by more than one segment.  This approach has several drawbacks.
> Listing files is superior to listing segments, first because no kludges are
> required to deal with extra-segment files like Lucene's doc stores, but also
> because it allows pluggable index components greater flexibility.  The only
> way that the "segments" model can be extended to handle arbitrary files is to
> add special case code to core classes.  In contrast, the list-of-files model
> allows individual components to manage their own data files, calling
> Snapshot_Add_Entry() when new files are added during indexing, and
> Snapshot_Delete_Entry() during merging when the plugin can determine that a
> file is truly no longer needed.

It still seems like storing per-segment metadata in the snapshot would
be necessary/helpful.

> Snapshot_Delete_Entry() does not delete the file from the index folder; all it
> does is remove the filename from the next snapshot to be written.  Once the
> new snapshot has been committed, it is possible to identify candidates for
> deletion by determining which files are present in the old snapshot file but
> gone from the new one.

Are you just doing reference counting to determine deletable files?

Will Lucy allow more than one snapshot to remain in the index?


View raw message