incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Managing fresh/stale indexes
Date Thu, 03 Nov 2011 01:20:42 GMT
On Wed, Nov 02, 2011 at 09:29:28AM +0200, goran kent wrote:
> As I'm typing this though, I'm thinking, what about
> Lucy::Search::IndexSearcher->serve?
> It's in a loop hiding it's internals - how do I get *it* to honour my
> $idx_lockfile?
>
> Hack LucyX/Remote/SearchServer.pm's sub serve()?  Perhaps between
> 
> read( $client_sock, $buf, $len );
> + sleep while $idx_lockfile is extant
> my $response   = $dispatch{$method}->( $self, thaw($buf) );

Well, we're up against two problems here.

First, as described earlier, we have the issue that Lucy's index opening
mechanism is not really designed to cope with full-index swapouts.

Second, SearchServer/SearchClient is not really set up to take advantage of
Lucy's near-real-time capabilities.  It was written a long time ago, and was
designed with static index data in mind.  It was anticipated that you might
restart the service and show users a new view of the index once per day, once
per hour or once per minute -- but not that SearchServer would be expected to
provide near-instantaneous availability of updates on cue.

I think that in order to get near-real-time happening with
SearchServer/SearchClient, we may need to enhance SearchServer to handle more
than one Searcher.

Instead of having SearchServer's constructor require a Searcher, we could have
it require an index path.  Then, when a SearchClient connects, SearchServer
could open a fresh IndexSearcher, add it to a hash keyed by snapshot number,
and pass back the snapshot number identifier to the SearchClient.

All subsequent search requests from the SearchClient would need to include
this snapshot number identifier.  And when the SearchClient decides that it's
finished and sends back a "done" command to the SearchServer, only the
IndexSearcher associated with that snapshot number gets zapped (or has its
refcount decremented, if we end up needing refcounting as I suspect we might).

[ Aside to Nate Kurz: This is *precisely* the kind of construct we'd hoped to
enable when we were working on Lucy's "cheap Searcher" model, no? :) ]

> My final question:  Peter says "For full index swap-outs, I just let
> existing Searchers finish and re-open
> themselves when they realize the underlying index is different" --
> peter are you saying LucyX::Remote::SearchServer does this
> automatically on it's own without intervention needed, 

SearchServer does not auto-update when the index changes, and it would be bad
if it did.

    http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/DocIDs.html#Document-ids-are-ephemeral
    
Searcher objects in Lucy always represent a point-in-time view of the indexed
corpus.  You don't want the content associated with a particular document id
to change in between performing a search and fetching a document!

> urgh, this is getting dirty ugly.
>
> /throws toys out cot, sobs

LOL, I understand the frustrations, but honestly you're right up against a
very exciting problem set.

We've wanted for a while now to build a distributed layer for Lucy that
handles sharding and replication: something that uses techniques similar to
Solr, but much thinner than Solr -- no REST interface, no extra query types,
no PDF parsing, etc -- and that fully exploits Lucy's excellent near-real-time
fundamentals.

Here at Eventful, we have our own proprietary distributed layer;
unfortunately, it's optimized for some specific requirements and it wouldn't
be easy to adapt for general use.  Nevertheless, we have the experience of
having built such a layer, and my colleagues and I had a very animated
lunchtime conversation today about how best to design a generalized tool that
would work for users like yourself.  If you want to work on such a tool for
Lucy, you're going to find collaborators!

Marvin Humphrey


Mime
View raw message