lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Splicing in a bit of caching in remote searcher
Date Mon, 07 Nov 2011 18:45:09 GMT
On Mon, Nov 07, 2011 at 10:36:23AM +0200, goran kent wrote:
> #---------incision start----------
> my $response;
> my $cached_object_id = md5sum($buf); # TODO: check if $buf is the search string
> 
> if (is_cached($cached_object_id)) {
>     $response = read_cached_object($cached_object_id);
> }
> else {
>     $response   = $dispatch{$method}->( $self, thaw($buf) );
> }
> #---------incision end----------

$buf is never the search string.  It's this, from SearchClient.pm:

    my $serialized = nfreeze($args);
    my $packed_len = pack( 'N', length($serialized) );
    print $sock "$method\n$packed_len$serialized";

$method is the name of the method to invoke on the SearchServer's local
Searcher object.  $args is a Perl hashref containing the arguments to pass to
that method, which is subsequently serialized using Storable's nfreeze()
function and becomes the scalar $serialized.

When the method is "top_docs", then $args will contain a Lucy::Search::Query
*object*.  The query string has already been parsed at this point, back in the
SearchClient; at no time does the raw query string ever get sent over the wire
to the SearchServer.

However, Query objects have a to_string() method you may be able to make use
of:

    if ($method eq 'top_docs') {
        my $args = thaw($buf)
        my $key = $args->{query}->to_string;
        if (is_cached($key)) {
            $response = read_cached_object($key);
        }
        else {
            $response = $dispatch{$method}->( $self, $args );
        }
    }

> I seem to recall though that the typical search is not an atomic
> transaction:  ie, the remote search protocol is broken up into
> discrete request/response chunks:

Correct.

> my $hits = $poly_searcher->hits(
>     query      => $parsed_query,
>     sort_spec  => $sort_spec,
>     offset     => 0,  # or 10, 20, etc
>     num_wanted => 10,
> );
> 
> 
> is processed roughly as:
> 
> doc_max/response
> doc_freq/response x 31
> ...
> top_docs/response
> fetch_doc/response x 10
> ...
> done
> 
> So, my question is basically:  which parts do I cache and what's the
> best way to identify those parts?

The only individual task which it could conceivably make sense to cache would
be top_docs().

All those calls to doc_freq() are part of the weighting process.  The
behavior is not ideal, but changing it is a bit of a can of worms and
server-side caching won't help, as the calls are all fast lookups.

> I have a feeling I'm going to have
> to package a group of request/responses to cache it in it's
> entirety,... or something.   --or maybe this is not feasible within
> the given framework.

I can't think of a way to bundle things up without significant refactoring of
how Lucy's searching works or rearchitecting of SearchClient.

I understand why you want to do this: it allows you to invalidate chunks of the
cache piecemeal as individual nodes move forwards, rather than invalidate the
whole cache whenever any one of the nodes changes.  Hopefully caching
top_docs() alone will help.

> I essentially need a better understanding of the client/server
> interaction process so I can formulate an approach to achieve
> remote-end caching of search queries (in Perl of course, since that's
> what's being used here).

Understanding how Query objects are compiled to Matcher objects would help, so
maybe check out Lucy::Docs::Cookbook::CustomQuery.  Those doc_freq calls
happen during the weighting stage, and are used to power IDF.

    http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/Cookbook/CustomQuery.html
    http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/IRTheory.html#TF-IDF-ranking-algorithm

Marvin Humphrey


Mime
View raw message