On Tue, Apr 17, 2= 012 at 8:16 PM, Mikhail Khludnev <mkhludnev@griddynamics.com> wrote:<= br>

Hello,

I can't help with the part= icular question, but can share some experience. My task is roughly the same= I've found the patch https://issues.apache.org/jira/browse/LUCENE= -2686 is absolutely useful (with one small addition, I'll post it i= n comments soon). By using it I have disjunction summing query with steady = subscorers.

Regards

On Tue, Apr= 17, 2012 at 2:37 PM, Li Li <fancyerii@gmail.com> wrote:

hi all,
=A0=A0=A0 I am now hacking the BooleanScorer2 to let it keep the= docID() of the leaf scorer(mostly possible TermScorer) the same as the top= -level Scorer. Why I want to do this is: When I Collect a doc, I want to kn= ow which term is matched(especially for BooleanClause whose Occur is SHOULD= ). we have discussed some solutions, such as adding bit masks in disjunctio= n scorers. with this method, when we finds a matched doc, we can recursivel= y find which leaf scorer is matched. But we think it's not very efficie= nt and not convenient to use(this is my proposal but not agreed by others i= n our team). and then we came up with another one: Modifying DisjunctionSum= Scorer.
=A0=A0 we analysed the codes and found that the only Scorers used by Boolea= nScorer2 that will make the children scorers' docID() not equal to pare= nt is an anonymous class inherited from DisjunctionSumScorer. All other one= s including SingleMatchScorer, countingConjunctionSumScorer(anonymous), dua= lConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. =A0=A0 The implementation algorithm of DisjunctionSumScorer use a heap to f= ind the smallest doc. after finding a matched doc, the currentDoc is the ma= tched doc and all the scorers in the heap will call nextDoc() so all of the= scorers' current docID the nextDoc of currentDoc. if there are N level= DisjunctionSumScorer, the leaf scorer's current doc is the n-th next d= ocId of the root of the scorer tree.
=A0=A0 So we modify the DisjuctionSumScorer and let it behavior as we expec= ted. And then I wrote some TestCase and it works well. And also I wrote som= e random generated TermScorer and compared the nextDoc(),score() and advanc= e(int) method of original DisjunctionSumScorer and modified one. nextDoc() = and score() and exactly the same. But for advance(int target), we found som= e interesting and strange things.
=A0=A0 at the beginning, I think if target is less than current docID, it w= ill just return current docID and do nothing. this assumption let my algori= thm go wrong. Then I read the codes of TermScorer and found each call of ad= vance(int) of TermScorer will call nextDoc() no matter whether current docI= D is larger than target or not.
=A0=A0 So I am confused and then read the javadoc of DocIdSetIterator:
-= ---------------- javadoc of DocIdSetIterator.advance(int target)-----------= --

int org.apache.lucene.search.DocIdSetIterator.advance(int target)= throws IOException

Advances to the first beyond (see NOTE below) the current whose documen= t number is greater than or equal
=A0to target. Returns the current doc= ument number or NO_MORE_DOCS if there are no more docs in the set.
Beha= ves as if written:
=A0int advance(int target) {
=A0=A0 int doc;
=A0=A0 while ((doc =3D n= extDoc()) < target) {
=A0=A0 }
=A0=A0 return doc;
=A0}
=A0So= me implementations are considerably more efficient than that.
NOTE: whe= n target < current implementations may opt not to advance beyond their c= urrent docID().
NOTE: this method may be called with NO_MORE_DOCS for efficiency by some Sc= orers. If your
=A0implementation cannot efficiently determine that it s= hould exhaust, it is recommended that you check for
=A0that value in ea= ch call to this method.
NOTE: after the iterator has exhausted you should not call this method, as = it may result in unpredicted
=A0behavior. =A0=A0
------------------= --------------------
Then I modified my algorithm again and found that D= isjunctionSumScorer.advance(int target) has some strange behavior. most of = the cases, it will return currentDoc if target < currentDoc. but in some= boundary condition, it will not.
it's not a bug but let me sad. I thought my algorithm has some bug beca= use it's advance method is not exactly the same as original Disjunction= SumScorer's.
----codes of DisjunctionSumScorer---
=A0 @Override =A0 public int advance(int target) throws IOException {
=A0=A0=A0 if (sc= orerDocQueue.size() < minimumNrMatchers) {
=A0=A0=A0=A0=A0 return cur= rentDoc =3D NO_MORE_DOCS;
=A0=A0=A0 }
=A0=A0=A0 if (target <=3D cu= rrentDoc) {
=A0=A0=A0=A0=A0 return currentDoc;
=A0=A0=A0 }
=A0=A0 ....
-------------------
for most case if (targ= et <=3D currentDoc) it will return currentDoc;
but if previous advanc= e will make sub scorers exhausted, then if may return NO_MORE_DOCS
an ex= ample is:
=A0=A0 currentDoc=3D-1
=A0=A0 minimumNrMatchers=3D1
=A0=A0 subScorers= :
=A0=A0=A0=A0=A0 TermScorer: docIds: [1, 2, 6]
=A0=A0=A0=A0=A0 Term= Scorer: docIds: [2, 4]
after first call advance(5)
=A0=A0=A0 currentD= oc=3D6
=A0=A0=A0 only first scorer is now in the heap, scorerDocQueue.si= ze()=3D=3D1
then call advance(6)
=A0=A0=A0 because scorerDocQueue.size() < minimu= mNrMatchers, it just return NO_MORE_DOCS

My question is why the adva= nce(int target) method is defined like this? for the reason of efficient or= any other reasons?
=A0=A0=A0

--
Sincerely yours
Mikhail Khludnev
=
gedel@yandex.ru