lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Sorting posting lists before intersection
Date Mon, 13 Oct 2008 15:00:06 GMT
Renaud Delbru wrote:
> Hi Andrzej,
> sorry for the late reply.
> I have looked at the code. As far as I understand, you sort the posting 
> lists based on the first doc skip. The first posting list will be the 
> one who have the first biggest document skip.
> Do the sparseness of posting lists is a good predictor for sampling and 
> ordering posting lists ? Do you know evaluation of such technique ?

It is _some_ predictor ... :) whether it's a good one is another 
question. It's certainly very inexpensive - we don't do any additional 
IO except what we have to do anyway, which is scorer.skipTo().

In general case it's costly to calculate the frequency (or sparseness) 
of matches in a scorer without actually running the scorer through all 
its matches.

> In order to implement sorting based on frequency, we need the document 
> frequency of each term. This information should be propagated through 
> the Scorer classes (from TermScorer to higher level class such as 
> ConjunctiveScorer). This will require a call to 
> IndexReader.docFreq(term) for each of the term queries. Is docFreq call 
> mean another IO access ?

It sounds like you plan to order scorers by term frequency ... but in 
general case they won't all be TermScorers, so the frequency of 
documents matching a scorer won't have any particular connection to a 
single term freq.

Answering your question: docFreq call uses TermInfo information, which 
uses a small RAM cache. If you're lucky then it won't cause any IO, 
otherwise it needs to read this info from the .ti file.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message