Li,
I can give you my understanding of difference in DisjunctionSumScorer and
https://issues.apache.org/jira/browse/LUCENE2686
When you choose steady children approach, even you omit call score() you
have to enumerate top of child scorers heap twice.
Check nextDoc() from the patch: first loop pushes top of heap first, then
the second loop through the top is done by
countMatches(1); countMatches(2);
Current DisjunctionSumScorer enumerate top of heap once per every doc:
while it loops through top of heap it counts number of clauses matched;
accumulate score; and pushes top children one step forward.
what's minimumNrMatchers do you have? can you upload your test github?
(mail list rips attachments off)
On Thu, Apr 19, 2012 at 7:34 AM, Li Li <fancyerii@gmail.com> wrote:
> Michael McCandless wrote:
>
> So... the good news is I made a new scorer (basically copied
> DisjunctionMaxScorer and then tweaked from there) that scores the ORonly
> case. All tests pass w/ this new scorer.
>
> And more good news is that if you don't score (I sort by doctitle to do
> that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
> expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
> it to use BS2 not BS – the nocommits in the patch).
>
> But the bad news is with scoring on it's 22.7% slower!
>
> And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
> faster for this one query. I think we need to modify the criteria that
> decides whether to use BS or BS2... maybe when there are lots of
> lowishdocFreq terms, BS2 is better?
> 
> why new algorithm is 22% slower than old one?
> I read the codes and feel them almost the same.
>
> On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Hello,
>>
>> I can't help with the particular question, but can share some experience.
>> My task is roughly the same I've found the patch
>> https://issues.apache.org/jira/browse/LUCENE2686 is absolutely useful
>> (with one small addition, I'll post it in comments soon). By using it I
>> have disjunction summing query with steady subscorers.
>>
>> Regards
>>
>> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fancyerii@gmail.com> wrote:
>>
>>> hi all,
>>> I am now hacking the BooleanScorer2 to let it keep the docID() of
>>> the leaf scorer(mostly possible TermScorer) the same as the toplevel
>>> Scorer. Why I want to do this is: When I Collect a doc, I want to know
>>> which term is matched(especially for BooleanClause whose Occur is SHOULD).
>>> we have discussed some solutions, such as adding bit masks in disjunction
>>> scorers. with this method, when we finds a matched doc, we can recursively
>>> find which leaf scorer is matched. But we think it's not very efficient and
>>> not convenient to use(this is my proposal but not agreed by others in our
>>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>> we analysed the codes and found that the only Scorers used by
>>> BooleanScorer2 that will make the children scorers' docID() not equal to
>>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>> The implementation algorithm of DisjunctionSumScorer use a heap to
>>> find the smallest doc. after finding a matched doc, the currentDoc is the
>>> matched doc and all the scorers in the heap will call nextDoc() so all of
>>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>>> DisjunctionSumScorer, the leaf scorer's current doc is the nth next docId
>>> of the root of the scorer tree.
>>> So we modify the DisjuctionSumScorer and let it behavior as we
>>> expected. And then I wrote some TestCase and it works well. And also I
>>> wrote some random generated TermScorer and compared the nextDoc(),score()
>>> and advance(int) method of original DisjunctionSumScorer and modified one.
>>> nextDoc() and score() and exactly the same. But for advance(int target), we
>>> found some interesting and strange things.
>>> at the beginning, I think if target is less than current docID, it
>>> will just return current docID and do nothing. this assumption let my
>>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>>> docID is larger than target or not.
>>> So I am confused and then read the javadoc of DocIdSetIterator:
>>>  javadoc of DocIdSetIterator.advance(int
>>> target)
>>>
>>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>>> IOException
>>>
>>> Advances to the first beyond (see NOTE below) the current whose document
>>> number is greater than or equal
>>> to target. Returns the current document number or NO_MORE_DOCS if there
>>> are no more docs in the set.
>>> Behaves as if written:
>>> int advance(int target) {
>>> int doc;
>>> while ((doc = nextDoc()) < target) {
>>> }
>>> return doc;
>>> }
>>> Some implementations are considerably more efficient than that.
>>> NOTE: when target < current implementations may opt not to advance
>>> beyond their current docID().
>>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>>> Scorers. If your
>>> implementation cannot efficiently determine that it should exhaust, it
>>> is recommended that you check for
>>> that value in each call to this method.
>>> NOTE: after the iterator has exhausted you should not call this method,
>>> as it may result in unpredicted
>>> behavior.
>>> 
>>> Then I modified my algorithm again and found that
>>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>>> the cases, it will return currentDoc if target < currentDoc. but in some
>>> boundary condition, it will not.
>>> it's not a bug but let me sad. I thought my algorithm has some bug
>>> because it's advance method is not exactly the same as original
>>> DisjunctionSumScorer's.
>>> codes of DisjunctionSumScorer
>>> @Override
>>> public int advance(int target) throws IOException {
>>> if (scorerDocQueue.size() < minimumNrMatchers) {
>>> return currentDoc = NO_MORE_DOCS;
>>> }
>>> if (target <= currentDoc) {
>>> return currentDoc;
>>> }
>>> ....
>>> 
>>> for most case if (target <= currentDoc) it will return currentDoc;
>>> but if previous advance will make sub scorers exhausted, then if may
>>> return NO_MORE_DOCS
>>> an example is:
>>> currentDoc=1
>>> minimumNrMatchers=1
>>> subScorers:
>>> TermScorer: docIds: [1, 2, 6]
>>> TermScorer: docIds: [2, 4]
>>> after first call advance(5)
>>> currentDoc=6
>>> only first scorer is now in the heap, scorerDocQueue.size()==1
>>> then call advance(6)
>>> because scorerDocQueue.size() < minimumNrMatchers, it just return
>>> NO_MORE_DOCS
>>>
>>> My question is why the advance(int target) method is defined like this?
>>> for the reason of efficient or any other reasons?
>>>
>>>
>>
>>
>>
>> 
>> Sincerely yours
>> Mikhail Khludnev
>> gedel@yandex.ru
>>
>> <http://www.griddynamics.com>
>> <mkhludnev@griddynamics.com>
>>
>>
>

Sincerely yours
Mikhail Khludnev
gedel@yandex.ru
<http://www.griddynamics.com>
<mkhludnev@griddynamics.com>
