lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?
Date Tue, 17 Apr 2012 10:42:44 GMT
some mistakes of the example:
after first call advance(5)
    currentDoc=6
    first scorer's nextDoc is called to in advance, the heap is empty now.
then call advance(6)
    because scorerDocQueue.size() < minimumNrMatchers, it just return
NO_MORE_DOCS

On Tue, Apr 17, 2012 at 6:37 PM, Li Li <fancyerii@gmail.com> wrote:

> hi all,
>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>    we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>    The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
>    So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
>    at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
>    So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
>  to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
>  int advance(int target) {
>    int doc;
>    while ((doc = nextDoc()) < target) {
>    }
>    return doc;
>  }
>  Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
>  implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
>  that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
>  behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
>   @Override
>   public int advance(int target) throws IOException {
>     if (scorerDocQueue.size() < minimumNrMatchers) {
>       return currentDoc = NO_MORE_DOCS;
>     }
>     if (target <= currentDoc) {
>       return currentDoc;
>     }
>    ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
>    currentDoc=-1
>    minimumNrMatchers=1
>    subScorers:
>       TermScorer: docIds: [1, 2, 6]
>       TermScorer: docIds: [2, 4]
> after first call advance(5)
>     currentDoc=6
>     only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
>     because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message