lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From goran kent <>
Subject Re: [lucy-user] Index state during merges
Date Wed, 02 Nov 2011 19:03:40 GMT
On Wed, Nov 2, 2011 at 8:29 PM, Marvin Humphrey <> wrote:
> Hi Goran,
> I think the general answer to your question is one you will be pleased to
> hear:
> Any indexing session which is aborted prior to the completion of
> Indexer#commit leaves the index completely unchanged.
> On Wed, Nov 02, 2011 at 05:23:13PM +0200, goran kent wrote:
>> during a merge of indexes to a $target, and an error occurs -- for
>> whatever reason (typically because of a broken source index) --  is
>> the $target buggered or still in a safe state?
> What do you mean by "broken source index"?  Corrupt because bad UTF-8 snuck
> in, and now it refuses to be read?
> Maybe we should consider scanning incoming fields for UTF-8 sanity after all.
> I don't like making everybody pay this penalty -- small though it is --
> because you'll only get bad UTF-8 if your indexing setup is broken somehow.
> On the other hand, I don't like that once a single bad UTF-8 sequence makes it
> through a commit, the index is irretrievably corrupt -- and you only discover
> that after the damage is done.

WIth precisely this in mind, my code does some gymnastics to try and
make sure bad utf8 doesn't make it in.  But,... you never know when
dealing with the vagaries of the 'tubes.

>> foreach $subindex  {  $bigindex->add_index($subindex); }
>> $bigindex->commit
>> if one of those subindexes is broken, will it break the entire
>> bigindex?
> If the commit succeeds, then "bigindex" will probably be busted.  If the
> commit fails, then "bigindex" will be unharmed.

ok.  Nothing else to do but see what happens then -- this is where a
util to rapidly scan (not correct anything) the index to verify it's
integrity would be incredibly useful.

>> or does Lucy work on temp files and if an error occurs the
>> bigindex is not harmed (ie, any changes are rolled back)?
> Staged changes aren't rolled back so much as they are never committed.
> These passages from Lucy::Docs::FileFormat provide useful background:
>    Write-once philosophy
>    All segment directory names consist of the string "seg_" followed by a
>    number in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers
>    indicating more recent segments. Once a segment is finished and committed,
>    its name is never re-used and its files are never modified.
>    ...
>    snapshot_XXX.json
>    A "snapshot" file, e.g. snapshot_m7p.json, is list of index files and
>    directories. Because index files, once written, are never modified, the
>    list of entries in a snapshot defines a point-in-time view of the data in
>    an index.
> Lucy creates some temporary files during indexing, but that doesn't get to the
> heart of the matter.  The main thing to understand is that Lucy creates a new
> segment during indexing, but that segment is not considered part of the index
> until a new snapshot file which references it gets moved into place.
> The actual commit point is an atomic action: we add a new hard link to a
> snapshot file which has been written out under a temporary file name.  At that
> moment, the new view of the index goes live, and any new IndexSearcher will
> see the new segment.
> If the indexing session is aborted prior to the addition of that hard link,
> then the new segment data remains orphaned and unreferenced.  It will sit
> there taking up space until the next time you fire up a new Indexer (or
> BackgroundMerger), at which point it will be wiped to make room for new data
> about to be written.

Thanks for that detailed response, much appreciated.

I just had a look at the logs after going ahead and making some
changes to see how things would fair:  some of my large index (in
place - ie, no move to temp, blah blah) merge sessions are croaking

Lucy::Index::Indexer->new failed (Failed to read seg_2
S_try_open_elements at
/home/projects/lucy/lucy/perl/../core/Lucy/Index/PolyReader.c line 251

If I inspect the index it looks like the seg_2 folder is empty.

/.../snapshot_2.json contains
  "entries": [
  "format": "2",
  "subformat": "1"

and there's a schema_2.json file.  Also, the locks/write.lock is
present indicating things ended badly.

I'll investigate what caused the in situ index update failure later
(probably a bad source index :P), but what interests me know is a
recovery procedure for these kinds of failures.

Is it ok to remove the seg_2 folder and lockfile and rename
snapshot_2.json to snapshot_1.json and likewise with schema_2.json ->


View raw message