lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Sorting posting lists before intersection
Date Mon, 13 Oct 2008 15:21:06 GMT
Op Monday 13 October 2008 17:00:06 schreef Andrzej Bialecki:
> Renaud Delbru wrote:
> > Hi Andrzej,
> >
> > sorry for the late reply.
> >
> > I have looked at the code. As far as I understand, you sort the
> > posting lists based on the first doc skip. The first posting list
> > will be the one who have the first biggest document skip.
> > Do the sparseness of posting lists is a good predictor for sampling
> > and ordering posting lists ? Do you know evaluation of such
> > technique ?
>
> It is _some_ predictor ... :) whether it's a good one is another
> question. It's certainly very inexpensive - we don't do any
> additional IO except what we have to do anyway, which is
> scorer.skipTo().
>
> In general case it's costly to calculate the frequency (or
> sparseness) of matches in a scorer without actually running the
> scorer through all its matches.
>
> > In order to implement sorting based on frequency, we need the
> > document frequency of each term. This information should be
> > propagated through the Scorer classes (from TermScorer to higher
> > level class such as ConjunctiveScorer). This will require a call to
> > IndexReader.docFreq(term) for each of the term queries. Is docFreq
> > call mean another IO access ?
>
> It sounds like you plan to order scorers by term frequency ... but in
> general case they won't all be TermScorers, so the frequency of
> documents matching a scorer won't have any particular connection to a
> single term freq.

This could be done, but since not all scorers will be TermScorers it
will be necessary to add a method to Scorer (or perhaps even to its
DocIdSetIterator superclass):

   public abstract int estimatedDocFreq();

and implement this for all existing instances. TermScorer could
implement it without estimating.
For AND/OR/NOT such an estimation is straightforward but for
proximity queries it would be more of a guess.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message