lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Aggregating multiple searchers
Date Tue, 15 Nov 2011 04:30:33 GMT
On Wed, Nov 09, 2011 at 11:14:31PM +0200, goran kent wrote:
> Just in case Marvin doesn't get around to ClusterSearcher, I'm
> wondering whether I can cobble something together using POE::Session
> to fire off multiple remote searcher requests
> (LucyX::Remote::SearchClient), wait for all to complete, then
> aggregate the results.
> 
> That last bit has me stumped.
 
> How can I aggregate the results from a bunch of
> LucyX::Remote::SearchClient objects?  Unfortunately there's no
> Lucy::Search::Aggregate.

The problem is that queries run against different indexes do not produce
comparable scores.

A naive implementation of an aggregator would do this:

  my $hits_a = $searcher_a->hits(query => $query);
  my $hits_b = $searcher_a->hits(query => $query);
  my @hit_docs;
  push(@hit_docs, $_) while $_ = $hits_a->next;
  push(@hit_docs, $_) while $_ = $hits_b->next;
  my @sorted = sort { $_[1]->get_score <=> $_[0]->get_score } @hit_docs;

However, say that you are searching for 'iphone' in two news archives, one
from 2001 and one from 2011.  In the more recent news archive, 'iphone'
will be a reasonably common term.  In the older news archive, 'iphone' will be
very rare -- let's imagine that it only appears in a single document, as a
typo.  Rare terms make for high scores -- so the top hit in your search for
'iphone' may well be the typo[1].

That's why you want to know the doc_freq for each term across the *entire*
corpus when performing query weighting.

That's not the only problem, but it's illustrative.

Marvin Humphrey

[1] I got this excellent example from Chris Hostetter.


Mime
View raw message