lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dag Lem <>
Subject Re: [lucy-user] ClusterSearcher statistics
Date Fri, 26 Oct 2012 09:35:44 GMT
Marvin Humphrey <> writes:


> First, regarding term extraction, it does not suffice to walk the query tree
> looking for TermQueries -- PhraseQueries also have terms, but more crucially,
> so do arbitrary user-defined Query subclasses.  In order to get at terms
> within arbitrary Query objects, Query needs an `extract_terms()` method which
> subclasses may have to override.
> Second, once you obtain an array of terms via `$query->extract_terms()` and
> bulk-fetch their stats from the remote shards, you need to cache the stats in
> a hash and override `doc_freq()`.  That way, when nested query weighting
> routines invoke `$searcher->doc_freq`, they get the stat which was bulk
> fetched moments before.


> There's a lot of dissatisfaction in Lucy-land with our labyrinthine
> search-time Query weighting mechanism.  The Lucene architecture we inherited
> is ridiculously convoluted and we've already been through a couple rounds of
> refactoring trying to simplify it.  The last thing we want to do is make it
> harder to write a custom query subclass when our users already struggle with
> the complexity of that task.

OK, so how about this poor man's solution?

1. Add a private function to Searcher to switch between three
   different behaviors of doc_freq() - normal operation, store
   field/term in cache, or retrieve freq from cache.

2. For ClusterSearcher, insert an extra call to QueryParser::Parse to
   store field/term in the cache (discarding the returned query), and
   call the new function doc_freqs() to add the freqs to the cache.
   Then, let the existing call to QueryParser::Parse retrieve from the
   cache and build the actual query.

Sure, it's a hack, but as far as I can tell it would not be very
intrusive nor change the public API.

> Besides, bulk-fetching of term stats is only an optimization to begin with,
> and it's a sub-optimal optimization in comparison to the approach of obtaining
> term stats locally.

That depends. IMHO the advantages of a fully distributed solution can
in many cases handily trump the theoretical (and far from achievable
in practice) 2x performance win of a local statistics
database. E.g. if I envision, some time in the future, *several*
clients querying the same Lucy sharded massive index, it smells like
unwanted complexity if I had to maintain a local index for each

Sure, if you have a single client where performance is paramount, and
adding more shards is not practical, then local statistics would be
very nice.

I'd say as Winnie-the-Pooh: Both! :-)


> > I guess this would be nice to have for applications which are
> > extremely performance sensitive.
> Doesn't that include your use case?

Not at all, really :) I've only been doing some tests on Lucy to see
whether it could be used in a possible future project. This would
cover a batch oriented system without any hard limits on performance.
I simply wanted to see just how fast things could run (faster is
always better), tested SearchServer / ClusterSearcher, and you know
the rest :-)

> I was hoping that this approach would meet your immediate needs. :\

Rest assured that Lucy would without a doubt cover my needs, if the
project should materialize! :-)

> No problem! :)
>     package EmptyStatSource;
>     use base qw( Lucy::Search::Searcher );
>     sub doc_freq {1}

Nice :-)

Best regards,

Dag Lem

View raw message