lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: Intuition check
Date Thu, 08 Nov 2007 20:16:29 GMT
On 8-Nov-07, at 8:59 AM, Chris Hostetter wrote:

> Let's back up a second...
>
> the theory is that while it's frequently handy to cache fq's  
> independent
> of the main query (because they are probably used over and over) in  
> some
> cases it may be advantageous to use an FQ directly in the body of hte
> main query to get better skipTo behavior. -- the fundemental issue is
> orthogonal to wether or not a DocSet for the FQ is cached, the  
> question
> is how should that FQ be used when computing the final DocList.
>
> So what if instead of letting the client say "this argument is an  
> fq which
> should be used to generate a BitSet and cached as a filter, this  
> argument
> is an fq.nocache which should be added to the main query" we  
> instead make
> SolrIndexSearcher smart enough to say "i've been asked to filter the
> main query using some DocSets, the intersection of those DocSets is  
> small
> enough, that instead of filtering the query on it, i'm going to add a
> query that only matches docs in it to the main query to improve skipTo
> behavior." ... so now clients don't have to know, they just pass in a
> bunch of fq params.   we still cache a DocSet for each one, but  
> when it
> comes time to do the search, we get the skipTo benefit anytime the
> intersection of all fqs is really small (wether the individual fqs are
> small enough individually or not)

I agree that this would be awesome if it can be pulled off.

> that should just be a simple change to getDocListNC right?

Let's think about this: To effectively do what you suggest, the query  
handling needs to

1. determine whether a given (set of) filter(s) would be effective in  
a skipTo context
2. embed the filter in the query as a scorer

I see difficulties with both, but perhaps they are not unsurmountable.

First, how to determine whether the filter-embedding would be  
effective?  We have at our disposal the size of the filter- 
intersection, assuming they are cached.  The most important criterion  
here is probably the relative size difference of the result set with  
the filter applied or not, which isn't really available.  It can be  
estimated assuming the filter and query are independent, but this  
definitely isn't always true.  If the filter isn't/shouldn't be  
cached?   You have to compute it separately for this (avoiding that  
is part of the goal).

Second, embedding the filter itself.  This is much more nettlesome  
within SolrIndexSearcher than within one of the request handlers.   
One problem is the use of BooleanScorer--I suppose we could detect  
that by walking the query tree looking for it.  Another is the  
embedding location: if filters are embedded in SIS, then then only  
reasonable option is to wrap everything in another top-level  
BooleanScorer with the original and filter query as required clauses  
(perhaps the filter would be inserted as prohibited if the inverse  
bitset was sufficiently sparse).  This means that the next()'s that  
happen to occur on the original query will pull in lots of extra  
scoring that might not be needed: bq's, bf's, pf's, whatever else is  
layered on the scoring (in my case, there are be 1-2 layers of  
multiplicative boosts as well).  It is nice to insert the filters  
directly into the "matching" part of the query.

Actually, nevermind: ReqOptSumScorer does not pull ahead its optional  
scorers until .score() is called, so the effects should be largely  
the same.

ISTM then that the main challenge is in determining when the filter  
intersection should be embedded.  Also, the ability to control filter  
caching is still difficult with this implementation, but perhaps  
that's less important.

Thanks for the feedback,
-Mike

Mime
View raw message