incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-user] Index state during merges
Date Wed, 02 Nov 2011 18:29:06 GMT
Hi Goran,

I think the general answer to your question is one you will be pleased to

Any indexing session which is aborted prior to the completion of
Indexer#commit leaves the index completely unchanged.

On Wed, Nov 02, 2011 at 05:23:13PM +0200, goran kent wrote:
> during a merge of indexes to a $target, and an error occurs -- for
> whatever reason (typically because of a broken source index) --  is
> the $target buggered or still in a safe state?

What do you mean by "broken source index"?  Corrupt because bad UTF-8 snuck
in, and now it refuses to be read?

Maybe we should consider scanning incoming fields for UTF-8 sanity after all.
I don't like making everybody pay this penalty -- small though it is --
because you'll only get bad UTF-8 if your indexing setup is broken somehow.
On the other hand, I don't like that once a single bad UTF-8 sequence makes it
through a commit, the index is irretrievably corrupt -- and you only discover
that after the damage is done.

> foreach $subindex  {  $bigindex->add_index($subindex); }
> $bigindex->commit
> if one of those subindexes is broken, will it break the entire
> bigindex?

If the commit succeeds, then "bigindex" will probably be busted.  If the
commit fails, then "bigindex" will be unharmed.

> or does Lucy work on temp files and if an error occurs the
> bigindex is not harmed (ie, any changes are rolled back)?

Staged changes aren't rolled back so much as they are never committed.

These passages from Lucy::Docs::FileFormat provide useful background:

    Write-once philosophy

    All segment directory names consist of the string "seg_" followed by a
    number in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers
    indicating more recent segments. Once a segment is finished and committed,
    its name is never re-used and its files are never modified.



    A "snapshot" file, e.g. snapshot_m7p.json, is list of index files and
    directories. Because index files, once written, are never modified, the
    list of entries in a snapshot defines a point-in-time view of the data in
    an index.

Lucy creates some temporary files during indexing, but that doesn't get to the
heart of the matter.  The main thing to understand is that Lucy creates a new
segment during indexing, but that segment is not considered part of the index
until a new snapshot file which references it gets moved into place.

The actual commit point is an atomic action: we add a new hard link to a
snapshot file which has been written out under a temporary file name.  At that
moment, the new view of the index goes live, and any new IndexSearcher will
see the new segment.

If the indexing session is aborted prior to the addition of that hard link,
then the new segment data remains orphaned and unreferenced.  It will sit
there taking up space until the next time you fire up a new Indexer (or
BackgroundMerger), at which point it will be wiped to make room for new data
about to be written.

Marvin Humphrey

View raw message