lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dag Lem <...@nimrod.no>
Subject Re: [lucy-user] ClusterSearcher statistics
Date Thu, 25 Oct 2012 09:51:29 GMT
Marvin Humphrey <marvin@rectangular.com> writes:

[...]

> To know how common a term is across the entire collection we need to survey
> all shards and sum the results.  All these calls must be completed before we
> can finish weighting the query, allowing us to call `top_docs()`.

OK.

> The calls to `doc_freq()` also cannot be consolidated together easily, because
> they are invoked by nested weighting methods within an arbitrarily complex
> compound query object.

I've thought a bit about this one; couldn't the problem in principle
be solved as follows?

1. Walk the query tree, storing references to all TermQuery objects in
   a list.
2. Call a new function, doc_freqs(), with a list of field/term pairs
   from the TermQuery objects as its argument. doc_freqs() would in
   essence call the existing doc_freq() for each field/term pair, and
   return some form of list of field/term/count triplets.
3. Store the returned counts in the corresponding TermQuery objects.
4. Replace all calls to doc_freq with lookups of precomputed counts in
   the TermQuery objects. (Alternatively, the calls can be kept by
   renaming the original doc_freq to something else for the call in
   2., and implementing a replacement doc_freq to do the lookup).

Or some workable variant of the above - you get the idea :-)

The upshot of this would be that only one network roundrip per server
would be necessary in order to get hold of all of the numbers of
documents per field/term, simply by replacing doc_freq() with
doc_freqs() in the application protocol.

What do you think?

> As an alternative, how about adding this new method to ClusterSearcher?
> 
>     =head2 set_stat_source
> 
>         my $local_searcher = Lucy::Search::IndexSearcher->new(
>             index => '/path/to/index',
>         );
>         $cluster_searcher->set_stat_source($local_searcher);
> 
>     Set the Searcher which will be used to find index statistics.

[...]

> So long as the ClusterSearcher runs on the same machine as a large,
> representative shard, using a local IndexSearcher should be a decent
> workaround.  Scoring will be messed up if e.g. the local shard is completely
> missing a term which is common on other shards, but at least it will be messed
> up in the same way for all hits across all shards.

I guess this would be nice to have for applications which are
extremely performance sensitive. Another idea would be to have the
possibility of omitting the fetching any statistics whatsoever, if
there should be use cases where relevancy based on term frequencies is
not needed.

Note, however, that assuming the solution I proposed above is
workable, the theoretical possible speedup for using local statistics
is no more than 2 (half the number of network roundtrips, assuming
zero cost for everything else), at the inconvenience of increased
infrastructural complexity and decreased accuracy of hit relevancy.

-- 
Best regards,

Dag Lem

Mime
View raw message