lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] ClusterSearcher statistics
Date Fri, 26 Oct 2012 01:25:27 GMT
On Thu, Oct 25, 2012 at 2:51 AM, Dag Lem <dag@nimrod.no> wrote:
> I've thought a bit about this one; couldn't the problem in principle
> be solved as follows?
>
> 1. Walk the query tree, storing references to all TermQuery objects in
>    a list.
> 2. Call a new function, doc_freqs(), with a list of field/term pairs
>    from the TermQuery objects as its argument. doc_freqs() would in
>    essence call the existing doc_freq() for each field/term pair, and
>    return some form of list of field/term/count triplets.
> 3. Store the returned counts in the corresponding TermQuery objects.
> 4. Replace all calls to doc_freq with lookups of precomputed counts in
>    the TermQuery objects. (Alternatively, the calls can be kept by
>    renaming the original doc_freq to something else for the call in
>    2., and implementing a replacement doc_freq to do the lookup).
>
> Or some workable variant of the above - you get the idea :-)
>
> The upshot of this would be that only one network roundrip per server
> would be necessary in order to get hold of all of the numbers of
> documents per field/term, simply by replacing doc_freq() with
> doc_freqs() in the application protocol.
>
> What do you think?

Your instincts are good -- we used to have something like that, inherited from
Lucene.  Here's the relevant section of Lucene 1.9's MultiSearcher.java:

    http://s.apache.org/PI6  (link to svn.apache.org)

There were two parts to the old strategy: extracting terms from arbitrary
Query objects, and capturing the retrieved stats to a cache.

First, regarding term extraction, it does not suffice to walk the query tree
looking for TermQueries -- PhraseQueries also have terms, but more crucially,
so do arbitrary user-defined Query subclasses.  In order to get at terms
within arbitrary Query objects, Query needs an `extract_terms()` method which
subclasses may have to override.

Second, once you obtain an array of terms via `$query->extract_terms()` and
bulk-fetch their stats from the remote shards, you need to cache the stats in
a hash and override `doc_freq()`.  That way, when nested query weighting
routines invoke `$searcher->doc_freq`, they get the stat which was bulk
fetched moments before.

Here's an approximation of how ClusterSearcher#doc_freq would need to look:

    sub doc_freq () {
        my ($self, %args) = @_;
        my $field = $args{field};
        my $term  = $args{term};
        my $cache = $self->_get_doc_freq_cache;
        if ($cache->{$field}) {
            $doc_freq = $cache->{$field}{$term};
            if (defined $doc_freq) {
                return $doc_freq;
            }
        }
        return 0;
    }

We killed this approach off a long time ago, though.

There's a lot of dissatisfaction in Lucy-land with our labyrinthine
search-time Query weighting mechanism.  The Lucene architecture we inherited
is ridiculously convoluted and we've already been through a couple rounds of
refactoring trying to simplify it.  The last thing we want to do is make it
harder to write a custom query subclass when our users already struggle with
the complexity of that task.

Besides, bulk-fetching of term stats is only an optimization to begin with,
and it's a sub-optimal optimization in comparison to the approach of obtaining
term stats locally.

In general, changing a public API to support an optimization should only be
done under extraordinary circumstances.  In this specific case, it's hard to
justify adding complexity to a public API which is already too complicated in
order to support an optimization when better alternatives exist.

>> So long as the ClusterSearcher runs on the same machine as a large,
>> representative shard, using a local IndexSearcher should be a decent
>> workaround.  Scoring will be messed up if e.g. the local shard is completely
>> missing a term which is common on other shards, but at least it will be messed
>> up in the same way for all hits across all shards.
>
> I guess this would be nice to have for applications which are
> extremely performance sensitive.

Doesn't that include your use case?

I was hoping that this approach would meet your immediate needs. :\

For what it's worth, it would be easy to implement.

> Another idea would be to have the
> possibility of omitting the fetching any statistics whatsoever, if
> there should be use cases where relevancy based on term frequencies is
> not needed.

No problem! :)

    package EmptyStatSource;
    use base qw( Lucy::Search::Searcher );

    sub doc_freq {1}

    ...

    $cluster_searcher->set_stat_source(EmptyStatSource->new);

> Note, however, that assuming the solution I proposed above is
> workable, the theoretical possible speedup for using local statistics
> is no more than 2 (half the number of network roundtrips, assuming
> zero cost for everything else), at the inconvenience of increased
> infrastructural complexity and decreased accuracy of hit relevancy.

That's true, but we'll need a better mechanism eventually, anyway.

Right now, we only support sharded search, but we plan to add support for
sharded indexing at some point.  I anticipate that we'll tackle maintaining a
term stats cache at that time.

Here's a paper describing the issue:

    http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

    Global Term Weights in Distributed Environments, H. Witschel, 2007

    This paper examines the estimation of global term weights (such as IDF) in
    information retrieval scenarios where a global view on the collection is
    not available.  In particular, the two options of either sampling
    documents or of using a reference corpus independent of the retrieval
    collection are compared using standard IR test collections. In addition,
    the possibility of pruning term lists based on frequency is evaluated.

Pruning low-frequency terms seems like the most appropriate approach for a
generic tool like Lucy, since we cannot guarantee that randomly sampled
documents are truly representative of a larger collection:

    The pruning approach taken here is very simple: terms with low frequency
    in the sample are pruned from the list, i.e. they are treated as if they
    had not occurred in the sample by assigning them the weight estimate for
    unseen terms.

However, sampling works well most of the time and the sample doesn't even need
to be that large -- so the approach of using a single shard to supply stats
should work pretty well as long as the shard is representative.

    This tells us that what is really needed for good retrieval performance is
    just a small list of very frequent terms (one could call it an “extended
    list of stop words”).

    ...

    Sampling is generally attractive in terms of effectiveness: in most cases,
    a sample of less than 5% of the collection was sufficient for close to
    optimal performance.

Marvin Humphrey

Mime
View raw message