Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C75E9854 for ; Wed, 18 Apr 2012 04:19:27 +0000 (UTC) Received: (qmail 13989 invoked by uid 500); 18 Apr 2012 04:19:25 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 13930 invoked by uid 500); 18 Apr 2012 04:19:25 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 13903 invoked by uid 99); 18 Apr 2012 04:19:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Apr 2012 04:19:24 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of fancyerii@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vx0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Apr 2012 04:19:17 +0000 Received: by vcbfl17 with SMTP id fl17so7288871vcb.35 for ; Tue, 17 Apr 2012 21:18:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=gnmHyyw/FcERLDVT941LnXmhVNRFHhQFpyEEr93xVRE=; b=sZGPRUZkd4zvbZSKLHg9XF1YjedgItKQkdmeCjf5v8++KQNjm81TRk+rMxKy7RqRUg T9so8j+D62ASQPmR+df4DKes3VcODNG22G0da0PEF8kHS2QSYowjtUi+H+9EtXofd4l4 QfQ4hXkP59sQECorXtn3D7VAlR9kiYksJPDwngVAzM1Xuo/R6uLY1ZgOusi2DJUuGyEZ XhNN+XOuc4DYOwnDhgOFP1bTlP74Vqzxjmo7yLsajnH9k6AHqmNJ2+cjii8sodublbuV No1uPzUN8YUylvliR/4DSVfCVs6do2Pk+aYhYqh36Y8BZ30nd4MIuD1hdal1/bH4WU2v WtEA== MIME-Version: 1.0 Received: by 10.52.89.106 with SMTP id bn10mr294709vdb.116.1334722736178; Tue, 17 Apr 2012 21:18:56 -0700 (PDT) Received: by 10.220.66.71 with HTTP; Tue, 17 Apr 2012 21:18:56 -0700 (PDT) In-Reply-To: References: Date: Wed, 18 Apr 2012 12:18:56 +0800 Message-ID: Subject: Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain? From: Li Li To: dev@lucene.apache.org Cc: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=20cf307f3a5690464604bdec5a1b --20cf307f3a5690464604bdec5a1b Content-Type: text/plain; charset=ISO-8859-1 thanks. it's great. On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev < mkhludnev@griddynamics.com> wrote: > Hello, > > I can't help with the particular question, but can share some experience. > My task is roughly the same I've found the patch > https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful > (with one small addition, I'll post it in comments soon). By using it I > have disjunction summing query with steady subscorers. > > Regards > > On Tue, Apr 17, 2012 at 2:37 PM, Li Li wrote: > >> hi all, >> I am now hacking the BooleanScorer2 to let it keep the docID() of the >> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer. >> Why I want to do this is: When I Collect a doc, I want to know which term >> is matched(especially for BooleanClause whose Occur is SHOULD). we have >> discussed some solutions, such as adding bit masks in disjunction scorers. >> with this method, when we finds a matched doc, we can recursively find >> which leaf scorer is matched. But we think it's not very efficient and not >> convenient to use(this is my proposal but not agreed by others in our >> team). and then we came up with another one: Modifying DisjunctionSumScorer. >> we analysed the codes and found that the only Scorers used by >> BooleanScorer2 that will make the children scorers' docID() not equal to >> parent is an anonymous class inherited from DisjunctionSumScorer. All other >> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous), >> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. >> The implementation algorithm of DisjunctionSumScorer use a heap to >> find the smallest doc. after finding a matched doc, the currentDoc is the >> matched doc and all the scorers in the heap will call nextDoc() so all of >> the scorers' current docID the nextDoc of currentDoc. if there are N level >> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId >> of the root of the scorer tree. >> So we modify the DisjuctionSumScorer and let it behavior as we >> expected. And then I wrote some TestCase and it works well. And also I >> wrote some random generated TermScorer and compared the nextDoc(),score() >> and advance(int) method of original DisjunctionSumScorer and modified one. >> nextDoc() and score() and exactly the same. But for advance(int target), we >> found some interesting and strange things. >> at the beginning, I think if target is less than current docID, it >> will just return current docID and do nothing. this assumption let my >> algorithm go wrong. Then I read the codes of TermScorer and found each call >> of advance(int) of TermScorer will call nextDoc() no matter whether current >> docID is larger than target or not. >> So I am confused and then read the javadoc of DocIdSetIterator: >> ----------------- javadoc of DocIdSetIterator.advance(int >> target)------------- >> >> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws >> IOException >> >> Advances to the first beyond (see NOTE below) the current whose document >> number is greater than or equal >> to target. Returns the current document number or NO_MORE_DOCS if there >> are no more docs in the set. >> Behaves as if written: >> int advance(int target) { >> int doc; >> while ((doc = nextDoc()) < target) { >> } >> return doc; >> } >> Some implementations are considerably more efficient than that. >> NOTE: when target < current implementations may opt not to advance beyond >> their current docID(). >> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some >> Scorers. If your >> implementation cannot efficiently determine that it should exhaust, it >> is recommended that you check for >> that value in each call to this method. >> NOTE: after the iterator has exhausted you should not call this method, >> as it may result in unpredicted >> behavior. >> -------------------------------------- >> Then I modified my algorithm again and found that >> DisjunctionSumScorer.advance(int target) has some strange behavior. most of >> the cases, it will return currentDoc if target < currentDoc. but in some >> boundary condition, it will not. >> it's not a bug but let me sad. I thought my algorithm has some bug >> because it's advance method is not exactly the same as original >> DisjunctionSumScorer's. >> ----codes of DisjunctionSumScorer--- >> @Override >> public int advance(int target) throws IOException { >> if (scorerDocQueue.size() < minimumNrMatchers) { >> return currentDoc = NO_MORE_DOCS; >> } >> if (target <= currentDoc) { >> return currentDoc; >> } >> .... >> ------------------- >> for most case if (target <= currentDoc) it will return currentDoc; >> but if previous advance will make sub scorers exhausted, then if may >> return NO_MORE_DOCS >> an example is: >> currentDoc=-1 >> minimumNrMatchers=1 >> subScorers: >> TermScorer: docIds: [1, 2, 6] >> TermScorer: docIds: [2, 4] >> after first call advance(5) >> currentDoc=6 >> only first scorer is now in the heap, scorerDocQueue.size()==1 >> then call advance(6) >> because scorerDocQueue.size() < minimumNrMatchers, it just return >> NO_MORE_DOCS >> >> My question is why the advance(int target) method is defined like this? >> for the reason of efficient or any other reasons? >> >> > > > > -- > Sincerely yours > Mikhail Khludnev > gedel@yandex.ru > > > > > --20cf307f3a5690464604bdec5a1b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable thanks. it's great.

On Tue, Apr 17, 2= 012 at 8:16 PM, Mikhail Khludnev <mkhludnev@griddynamics.com> wrote:<= br>
Hello,

I can't help with the part= icular question, but can share some experience. My task is roughly the same= I've found the patch https://issues.apache.org/jira/browse/LUCENE= -2686 is absolutely useful (with one small addition, I'll post it i= n comments soon). By using it I have disjunction summing query with steady = subscorers.

Regards

On Tue, Apr= 17, 2012 at 2:37 PM, Li Li <fancyerii@gmail.com> wrote:
hi all,
=A0=A0=A0 I am now hacking the BooleanScorer2 to let it keep the= docID() of the leaf scorer(mostly possible TermScorer) the same as the top= -level Scorer. Why I want to do this is: When I Collect a doc, I want to kn= ow which term is matched(especially for BooleanClause whose Occur is SHOULD= ). we have discussed some solutions, such as adding bit masks in disjunctio= n scorers. with this method, when we finds a matched doc, we can recursivel= y find which leaf scorer is matched. But we think it's not very efficie= nt and not convenient to use(this is my proposal but not agreed by others i= n our team). and then we came up with another one: Modifying DisjunctionSum= Scorer.
=A0=A0 we analysed the codes and found that the only Scorers used by Boolea= nScorer2 that will make the children scorers' docID() not equal to pare= nt is an anonymous class inherited from DisjunctionSumScorer. All other one= s including SingleMatchScorer, countingConjunctionSumScorer(anonymous), dua= lConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. =A0=A0 The implementation algorithm of DisjunctionSumScorer use a heap to f= ind the smallest doc. after finding a matched doc, the currentDoc is the ma= tched doc and all the scorers in the heap will call nextDoc() so all of the= scorers' current docID the nextDoc of currentDoc. if there are N level= DisjunctionSumScorer, the leaf scorer's current doc is the n-th next d= ocId of the root of the scorer tree.
=A0=A0 So we modify the DisjuctionSumScorer and let it behavior as we expec= ted. And then I wrote some TestCase and it works well. And also I wrote som= e random generated TermScorer and compared the nextDoc(),score() and advanc= e(int) method of original DisjunctionSumScorer and modified one. nextDoc() = and score() and exactly the same. But for advance(int target), we found som= e interesting and strange things.
=A0=A0 at the beginning, I think if target is less than current docID, it w= ill just return current docID and do nothing. this assumption let my algori= thm go wrong. Then I read the codes of TermScorer and found each call of ad= vance(int) of TermScorer will call nextDoc() no matter whether current docI= D is larger than target or not.
=A0=A0 So I am confused and then read the javadoc of DocIdSetIterator:
-= ---------------- javadoc of DocIdSetIterator.advance(int target)-----------= --

int org.apache.lucene.search.DocIdSetIterator.advance(int target)= throws IOException

Advances to the first beyond (see NOTE below) the current whose documen= t number is greater than or equal
=A0to target. Returns the current doc= ument number or NO_MORE_DOCS if there are no more docs in the set.
Beha= ves as if written:
=A0int advance(int target) {
=A0=A0 int doc;
=A0=A0 while ((doc =3D n= extDoc()) < target) {
=A0=A0 }
=A0=A0 return doc;
=A0}
=A0So= me implementations are considerably more efficient than that.
NOTE: whe= n target < current implementations may opt not to advance beyond their c= urrent docID().
NOTE: this method may be called with NO_MORE_DOCS for efficiency by some Sc= orers. If your
=A0implementation cannot efficiently determine that it s= hould exhaust, it is recommended that you check for
=A0that value in ea= ch call to this method.
NOTE: after the iterator has exhausted you should not call this method, as = it may result in unpredicted
=A0behavior. =A0=A0
------------------= --------------------
Then I modified my algorithm again and found that D= isjunctionSumScorer.advance(int target) has some strange behavior. most of = the cases, it will return currentDoc if target < currentDoc. but in some= boundary condition, it will not.
it's not a bug but let me sad. I thought my algorithm has some bug beca= use it's advance method is not exactly the same as original Disjunction= SumScorer's.
----codes of DisjunctionSumScorer---
=A0 @Override =A0 public int advance(int target) throws IOException {
=A0=A0=A0 if (sc= orerDocQueue.size() < minimumNrMatchers) {
=A0=A0=A0=A0=A0 return cur= rentDoc =3D NO_MORE_DOCS;
=A0=A0=A0 }
=A0=A0=A0 if (target <=3D cu= rrentDoc) {
=A0=A0=A0=A0=A0 return currentDoc;
=A0=A0=A0 }
=A0=A0 ....
-------------------
for most case if (targ= et <=3D currentDoc) it will return currentDoc;
but if previous advanc= e will make sub scorers exhausted, then if may return NO_MORE_DOCS
an ex= ample is:
=A0=A0 currentDoc=3D-1
=A0=A0 minimumNrMatchers=3D1
=A0=A0 subScorers= :
=A0=A0=A0=A0=A0 TermScorer: docIds: [1, 2, 6]
=A0=A0=A0=A0=A0 Term= Scorer: docIds: [2, 4]
after first call advance(5)
=A0=A0=A0 currentD= oc=3D6
=A0=A0=A0 only first scorer is now in the heap, scorerDocQueue.si= ze()=3D=3D1
then call advance(6)
=A0=A0=A0 because scorerDocQueue.size() < minimu= mNrMatchers, it just return NO_MORE_DOCS

My question is why the adva= nce(int target) method is defined like this? for the reason of efficient or= any other reasons?
=A0=A0=A0



--
Sincerely yours
Mikhail Khludnev
=

--20cf307f3a5690464604bdec5a1b--