lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?
Date Thu, 19 Apr 2012 03:34:27 GMT
Michael McCandless wrote:

So... the good news is I made a new scorer (basically copied
DisjunctionMaxScorer and then tweaked from there) that scores the OR-only
case. All tests pass w/ this new scorer.

And more good news is that if you don't score (I sort by doctitle to do
that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
it to use BS2 not BS – the nocommits in the patch).

But the bad news is with scoring on it's 22.7% slower!

And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
faster for this one query. I think we need to modify the criteria that
decides whether to use BS or BS2... maybe when there are lots of
lowish-docFreq terms, BS2 is better?
---------------
why new algorithm is 22% slower than old one?
I read the codes and feel them almost the same.

On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I can't help with the particular question, but can share some experience.
> My task is roughly the same I've found the patch
> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
> (with one small addition, I'll post it in comments soon). By using it I
> have disjunction summing query with steady subscorers.
>
> Regards
>
> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fancyerii@gmail.com> wrote:
>
>> hi all,
>>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
>> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
>> Why I want to do this is: When I Collect a doc, I want to know which term
>> is matched(especially for BooleanClause whose Occur is SHOULD). we have
>> discussed some solutions, such as adding bit masks in disjunction scorers.
>> with this method, when we finds a matched doc, we can recursively find
>> which leaf scorer is matched. But we think it's not very efficient and not
>> convenient to use(this is my proposal but not agreed by others in our
>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>    we analysed the codes and found that the only Scorers used by
>> BooleanScorer2 that will make the children scorers' docID() not equal to
>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>> find the smallest doc. after finding a matched doc, the currentDoc is the
>> matched doc and all the scorers in the heap will call nextDoc() so all of
>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>> of the root of the scorer tree.
>>    So we modify the DisjuctionSumScorer and let it behavior as we
>> expected. And then I wrote some TestCase and it works well. And also I
>> wrote some random generated TermScorer and compared the nextDoc(),score()
>> and advance(int) method of original DisjunctionSumScorer and modified one.
>> nextDoc() and score() and exactly the same. But for advance(int target), we
>> found some interesting and strange things.
>>    at the beginning, I think if target is less than current docID, it
>> will just return current docID and do nothing. this assumption let my
>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>> docID is larger than target or not.
>>    So I am confused and then read the javadoc of DocIdSetIterator:
>> ----------------- javadoc of DocIdSetIterator.advance(int
>> target)-------------
>>
>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>> IOException
>>
>> Advances to the first beyond (see NOTE below) the current whose document
>> number is greater than or equal
>>  to target. Returns the current document number or NO_MORE_DOCS if there
>> are no more docs in the set.
>> Behaves as if written:
>>  int advance(int target) {
>>    int doc;
>>    while ((doc = nextDoc()) < target) {
>>    }
>>    return doc;
>>  }
>>  Some implementations are considerably more efficient than that.
>> NOTE: when target < current implementations may opt not to advance beyond
>> their current docID().
>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>> Scorers. If your
>>  implementation cannot efficiently determine that it should exhaust, it
>> is recommended that you check for
>>  that value in each call to this method.
>> NOTE: after the iterator has exhausted you should not call this method,
>> as it may result in unpredicted
>>  behavior.
>> --------------------------------------
>> Then I modified my algorithm again and found that
>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>> the cases, it will return currentDoc if target < currentDoc. but in some
>> boundary condition, it will not.
>> it's not a bug but let me sad. I thought my algorithm has some bug
>> because it's advance method is not exactly the same as original
>> DisjunctionSumScorer's.
>> ----codes of DisjunctionSumScorer---
>>   @Override
>>   public int advance(int target) throws IOException {
>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>       return currentDoc = NO_MORE_DOCS;
>>     }
>>     if (target <= currentDoc) {
>>       return currentDoc;
>>     }
>>    ....
>> -------------------
>> for most case if (target <= currentDoc) it will return currentDoc;
>> but if previous advance will make sub scorers exhausted, then if may
>> return NO_MORE_DOCS
>> an example is:
>>    currentDoc=-1
>>    minimumNrMatchers=1
>>    subScorers:
>>       TermScorer: docIds: [1, 2, 6]
>>       TermScorer: docIds: [2, 4]
>> after first call advance(5)
>>     currentDoc=6
>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>> then call advance(6)
>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>> NO_MORE_DOCS
>>
>> My question is why the advance(int target) method is defined like this?
>> for the reason of efficient or any other reasons?
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mkhludnev@griddynamics.com>
>
>

Mime
View raw message