lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Delbru <renaud.del...@deri.org>
Subject Re: Sorting posting lists before intersection
Date Mon, 13 Oct 2008 15:46:28 GMT
Andrzej Bialecki wrote:
> Renaud Delbru wrote:
>> Hi Andrzej,
>>
>> sorry for the late reply.
>>
>> I have looked at the code. As far as I understand, you sort the 
>> posting lists based on the first doc skip. The first posting list 
>> will be the one who have the first biggest document skip.
>> Do the sparseness of posting lists is a good predictor for sampling 
>> and ordering posting lists ? Do you know evaluation of such technique ?
>
> It is _some_ predictor ... :) whether it's a good one is another 
> question. It's certainly very inexpensive - we don't do any additional 
> IO except what we have to do anyway, which is scorer.skipTo().
>
> In general case it's costly to calculate the frequency (or sparseness) 
> of matches in a scorer without actually running the scorer through all 
> its matches.
You can estimate the frequency for some scorers, such as 
ConjunctiveScorer, DisjunctiveScorer, etc., as Paul Eschot explained in 
the other reply.
> Answering your question: docFreq call uses TermInfo information, which 
> uses a small RAM cache. If you're lucky then it won't cause any IO, 
> otherwise it needs to read this info from the .ti file.
Thanks for the clarification.
If we assume that a query will be composed of few terms, this will 
require, in the worst case, one IO access per term. I think the cost of 
the additional IO access can be balanced by the better prediction that 
gives the frequency. This is something to benchmark / evaluate.

Regards
-- 
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message