I got it.
I found my test codes some problem.
the new test result is:
when minShould=1, new algorithm is a little bit faster than old one. when
minShould>1, old algorithm is faster.
On Sun, Apr 22, 2012 at 3:32 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:
> Li,
>
> I can give you my understanding of difference in DisjunctionSumScorer and
> https://issues.apache.org/jira/browse/LUCENE2686
>
> When you choose steady children approach, even you omit call score() you
> have to enumerate top of child scorers heap twice.
> Check nextDoc() from the patch: first loop pushes top of heap first, then
> the second loop through the top is done by
> countMatches(1); countMatches(2);
>
> Current DisjunctionSumScorer enumerate top of heap once per every doc:
> while it loops through top of heap it counts number of clauses matched;
> accumulate score; and pushes top children one step forward.
>
> what's minimumNrMatchers do you have? can you upload your test github?
> (mail list rips attachments off)
>
> On Thu, Apr 19, 2012 at 7:34 AM, Li Li <fancyerii@gmail.com> wrote:
>
>> Michael McCandless wrote:
>>
>> So... the good news is I made a new scorer (basically copied
>> DisjunctionMaxScorer and then tweaked from there) that scores the ORonly
>> case. All tests pass w/ this new scorer.
>>
>> And more good news is that if you don't score (I sort by doctitle to do
>> that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
>> expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
>> it to use BS2 not BS – the nocommits in the patch).
>>
>> But the bad news is with scoring on it's 22.7% slower!
>>
>> And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
>> faster for this one query. I think we need to modify the criteria that
>> decides whether to use BS or BS2... maybe when there are lots of
>> lowishdocFreq terms, BS2 is better?
>> 
>> why new algorithm is 22% slower than old one?
>> I read the codes and feel them almost the same.
>>
>> On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I can't help with the particular question, but can share some
>>> experience. My task is roughly the same I've found the patch
>>> https://issues.apache.org/jira/browse/LUCENE2686 is absolutely useful
>>> (with one small addition, I'll post it in comments soon). By using it I
>>> have disjunction summing query with steady subscorers.
>>>
>>> Regards
>>>
>>> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fancyerii@gmail.com> wrote:
>>>
>>>> hi all,
>>>> I am now hacking the BooleanScorer2 to let it keep the docID() of
>>>> the leaf scorer(mostly possible TermScorer) the same as the toplevel
>>>> Scorer. Why I want to do this is: When I Collect a doc, I want to know
>>>> which term is matched(especially for BooleanClause whose Occur is SHOULD).
>>>> we have discussed some solutions, such as adding bit masks in disjunction
>>>> scorers. with this method, when we finds a matched doc, we can recursively
>>>> find which leaf scorer is matched. But we think it's not very efficient and
>>>> not convenient to use(this is my proposal but not agreed by others in our
>>>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>>> we analysed the codes and found that the only Scorers used by
>>>> BooleanScorer2 that will make the children scorers' docID() not equal to
>>>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>>>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>>>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>>> The implementation algorithm of DisjunctionSumScorer use a heap to
>>>> find the smallest doc. after finding a matched doc, the currentDoc is the
>>>> matched doc and all the scorers in the heap will call nextDoc() so all of
>>>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>>>> DisjunctionSumScorer, the leaf scorer's current doc is the nth next docId
>>>> of the root of the scorer tree.
>>>> So we modify the DisjuctionSumScorer and let it behavior as we
>>>> expected. And then I wrote some TestCase and it works well. And also I
>>>> wrote some random generated TermScorer and compared the nextDoc(),score()
>>>> and advance(int) method of original DisjunctionSumScorer and modified one.
>>>> nextDoc() and score() and exactly the same. But for advance(int target),
we
>>>> found some interesting and strange things.
>>>> at the beginning, I think if target is less than current docID, it
>>>> will just return current docID and do nothing. this assumption let my
>>>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>>>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>>>> docID is larger than target or not.
>>>> So I am confused and then read the javadoc of DocIdSetIterator:
>>>>  javadoc of DocIdSetIterator.advance(int
>>>> target)
>>>>
>>>> int org.apache.lucene.search.DocIdSetIterator.advance(int target)
>>>> throws IOException
>>>>
>>>> Advances to the first beyond (see NOTE below) the current whose
>>>> document number is greater than or equal
>>>> to target. Returns the current document number or NO_MORE_DOCS if
>>>> there are no more docs in the set.
>>>> Behaves as if written:
>>>> int advance(int target) {
>>>> int doc;
>>>> while ((doc = nextDoc()) < target) {
>>>> }
>>>> return doc;
>>>> }
>>>> Some implementations are considerably more efficient than that.
>>>> NOTE: when target < current implementations may opt not to advance
>>>> beyond their current docID().
>>>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by
>>>> some Scorers. If your
>>>> implementation cannot efficiently determine that it should exhaust, it
>>>> is recommended that you check for
>>>> that value in each call to this method.
>>>> NOTE: after the iterator has exhausted you should not call this method,
>>>> as it may result in unpredicted
>>>> behavior.
>>>> 
>>>> Then I modified my algorithm again and found that
>>>> DisjunctionSumScorer.advance(int target) has some strange behavior. most
of
>>>> the cases, it will return currentDoc if target < currentDoc. but in some
>>>> boundary condition, it will not.
>>>> it's not a bug but let me sad. I thought my algorithm has some bug
>>>> because it's advance method is not exactly the same as original
>>>> DisjunctionSumScorer's.
>>>> codes of DisjunctionSumScorer
>>>> @Override
>>>> public int advance(int target) throws IOException {
>>>> if (scorerDocQueue.size() < minimumNrMatchers) {
>>>> return currentDoc = NO_MORE_DOCS;
>>>> }
>>>> if (target <= currentDoc) {
>>>> return currentDoc;
>>>> }
>>>> ....
>>>> 
>>>> for most case if (target <= currentDoc) it will return currentDoc;
>>>> but if previous advance will make sub scorers exhausted, then if may
>>>> return NO_MORE_DOCS
>>>> an example is:
>>>> currentDoc=1
>>>> minimumNrMatchers=1
>>>> subScorers:
>>>> TermScorer: docIds: [1, 2, 6]
>>>> TermScorer: docIds: [2, 4]
>>>> after first call advance(5)
>>>> currentDoc=6
>>>> only first scorer is now in the heap, scorerDocQueue.size()==1
>>>> then call advance(6)
>>>> because scorerDocQueue.size() < minimumNrMatchers, it just return
>>>> NO_MORE_DOCS
>>>>
>>>> My question is why the advance(int target) method is defined like this?
>>>> for the reason of efficient or any other reasons?
>>>>
>>>>
>>>
>>>
>>>
>>> 
>>> Sincerely yours
>>> Mikhail Khludnev
>>> gedel@yandex.ru
>>>
>>> <http://www.griddynamics.com>
>>> <mkhludnev@griddynamics.com>
>>>
>>>
>>
>
>
> 
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
> <mkhludnev@griddynamics.com>
>
>
