incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Managing fresh/stale indexes
Date Tue, 01 Nov 2011 17:36:00 GMT
On Tue, Nov 01, 2011 at 11:37:30AM +0200, goran kent wrote:
> I'm pondering about the best approach to handle fresh/stale indexes
> while searches are happening (ie, we don't want to disrupt an active
> search).

Once an IndexSearcher has opened successfully, it no longer needs access to
the index dir[1] -- you can actually wipe the index dir and the IndexSearcher
will keep functioning indefinitely.  This is achieved by caching open
filehandles to all files the IndexSearcher could ever need access to, so that
it will still have access even if the files are deleted[2].
 
> Let's say you have index1 which is stale, but actively being searched on.  I
> want to automate the process of switching in a fresh index1.fresh (symlink
> or move).  Let's say index1 is being hammered by active searches: what's the
> safest way to switch in index1.fresh?  Have an intermediary?:
>
> index1   --  a symlink to either:
> 
> index1.a/
> or
> index1.b/

Swapping out a symlink for a new one is not totally safe.  It will not affect
any existing IndexSearchers which are already open, but if the swap happens
while an attempt to open a new IndexSearcher is underway, you will get an
exception.

Lucy has a retry mechanism[3] which fires in case the index is modified and a
file needed by a particular snapshot gets deleted before the opening attempt
completes successfully.  However, that retry code works by looking for a more
recent snapshot file, as measured by the generation number embedded in the
snapshot file's name.  It will give up if it can't find such a newer snapshot
file, and thus it will not attempt to read a fresh new index which suddenly
appears in the location it was trying to open.

Put another way, the retry mechanism is only there to guarantee that opening
an index is never disrupted by updates.  It's not there to support what you
are trying to do.

> (and a control file which indicates the last stale one so the link
> cycles through them correctly -- and perhaps a lockfile which tells
> searchers: hey, hold for 1s while I reassign this symlink; but perhaps
> this is overcomplicating things)

To guard against the possible interruption of an open, you may need something
like that.

I presume you are planning on making live updates to indexes you are searching
with SearchServer/SearchClient/PolySearcher, though, right?  In that case, you
may need a more sophisticated mechanism in order to move the entire cluster
forward without disrupting any searches which are underway (including fetching
docs and highlighting).

Marvin Humphrey

[1] This will change if we ever support non-compound index file formats,
    because then e.g. the Open_In() calls in SegPostingList and SegLexicon's
    constructors would need to operate on real files.  However, there are no
    plans to support the non-compound format, because Lucy uses way, way more
    "files" than Lucene.

[2] On Unix.  On Windows, you can't delete a file that's got a filehandle open
    against it, so the mechanism is slightly different.  Regardless, the
    robustness of an open IndexSearcher is something you can count on across
    all operating systems.

[3] The retry logic is in PolyReader_do_open():
    http://svn.apache.org/viewvc/incubator/lucy/trunk/core/Lucy/Index/PolyReader.c?view=markup#l275


Mime
View raw message