Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B10DA9B4D for ; Tue, 17 Apr 2012 12:17:20 +0000 (UTC) Received: (qmail 25637 invoked by uid 500); 17 Apr 2012 12:17:19 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 25517 invoked by uid 500); 17 Apr 2012 12:17:19 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 25509 invoked by uid 99); 17 Apr 2012 12:17:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Apr 2012 12:17:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mkhludnev@griddynamics.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Apr 2012 12:17:13 +0000 Received: by pbbjt11 with SMTP id jt11so9673153pbb.35 for ; Tue, 17 Apr 2012 05:16:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=gKyKDf+qLGBK2jpE59+loVRisO28J4LDjq47yrphfNg=; b=jU6Qh+thk6c5UNOGuCQfwuPnMHmxyhcxwH+QBU5K+uAr07eNMF0cbxsTHWlNlfNL/u pQzYL97KnaA8JDw7af0PLY4j6CMNYvtt+Y94M8fkZkI5sHMeVOzQe+bsgSWx9ZyLjIxf uodea0D8yZZUqUs/GObl9YMceYg0+eoGQVIkzvQxrAOeDh9uOnjq1jZuX8BHCw8C3GVF pWgi+IIRDCV0ePwhwRRexkF2S43fxFqcxf2YeZC4tuVobiqcG6oCaOC7Wd00lTwM8Nup Vl/jCY1zoTPt5FT8W1AnTvxxtiurOwXw2LO25ZCicQT6C1/nGG1sUpNjEL/tQWMrnpt3 bTHw== Received: by 10.68.241.132 with SMTP id wi4mr3692558pbc.148.1334665013385; Tue, 17 Apr 2012 05:16:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.66.168 with HTTP; Tue, 17 Apr 2012 05:16:12 -0700 (PDT) In-Reply-To: References: From: Mikhail Khludnev Date: Tue, 17 Apr 2012 16:16:12 +0400 Message-ID: Subject: Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain? To: dev@lucene.apache.org Cc: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=047d7b339cf7045b3304bddeeaab X-Gm-Message-State: ALoCoQlVb9TATkVk6sHtNXFdV/rlwtxBCzKltSTvJEwY0Uu56kafzCqtZENhnWB9L95c75le4FND X-Virus-Checked: Checked by ClamAV on apache.org --047d7b339cf7045b3304bddeeaab Content-Type: text/plain; charset=ISO-8859-1 Hello, I can't help with the particular question, but can share some experience. My task is roughly the same I've found the patch https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful (with one small addition, I'll post it in comments soon). By using it I have disjunction summing query with steady subscorers. Regards On Tue, Apr 17, 2012 at 2:37 PM, Li Li wrote: > hi all, > I am now hacking the BooleanScorer2 to let it keep the docID() of the > leaf scorer(mostly possible TermScorer) the same as the top-level Scorer. > Why I want to do this is: When I Collect a doc, I want to know which term > is matched(especially for BooleanClause whose Occur is SHOULD). we have > discussed some solutions, such as adding bit masks in disjunction scorers. > with this method, when we finds a matched doc, we can recursively find > which leaf scorer is matched. But we think it's not very efficient and not > convenient to use(this is my proposal but not agreed by others in our > team). and then we came up with another one: Modifying DisjunctionSumScorer. > we analysed the codes and found that the only Scorers used by > BooleanScorer2 that will make the children scorers' docID() not equal to > parent is an anonymous class inherited from DisjunctionSumScorer. All other > ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous), > dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. > The implementation algorithm of DisjunctionSumScorer use a heap to find > the smallest doc. after finding a matched doc, the currentDoc is the > matched doc and all the scorers in the heap will call nextDoc() so all of > the scorers' current docID the nextDoc of currentDoc. if there are N level > DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId > of the root of the scorer tree. > So we modify the DisjuctionSumScorer and let it behavior as we > expected. And then I wrote some TestCase and it works well. And also I > wrote some random generated TermScorer and compared the nextDoc(),score() > and advance(int) method of original DisjunctionSumScorer and modified one. > nextDoc() and score() and exactly the same. But for advance(int target), we > found some interesting and strange things. > at the beginning, I think if target is less than current docID, it will > just return current docID and do nothing. this assumption let my algorithm > go wrong. Then I read the codes of TermScorer and found each call of > advance(int) of TermScorer will call nextDoc() no matter whether current > docID is larger than target or not. > So I am confused and then read the javadoc of DocIdSetIterator: > ----------------- javadoc of DocIdSetIterator.advance(int > target)------------- > > int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws > IOException > > Advances to the first beyond (see NOTE below) the current whose document > number is greater than or equal > to target. Returns the current document number or NO_MORE_DOCS if there > are no more docs in the set. > Behaves as if written: > int advance(int target) { > int doc; > while ((doc = nextDoc()) < target) { > } > return doc; > } > Some implementations are considerably more efficient than that. > NOTE: when target < current implementations may opt not to advance beyond > their current docID(). > NOTE: this method may be called with NO_MORE_DOCS for efficiency by some > Scorers. If your > implementation cannot efficiently determine that it should exhaust, it is > recommended that you check for > that value in each call to this method. > NOTE: after the iterator has exhausted you should not call this method, as > it may result in unpredicted > behavior. > -------------------------------------- > Then I modified my algorithm again and found that > DisjunctionSumScorer.advance(int target) has some strange behavior. most of > the cases, it will return currentDoc if target < currentDoc. but in some > boundary condition, it will not. > it's not a bug but let me sad. I thought my algorithm has some bug because > it's advance method is not exactly the same as original > DisjunctionSumScorer's. > ----codes of DisjunctionSumScorer--- > @Override > public int advance(int target) throws IOException { > if (scorerDocQueue.size() < minimumNrMatchers) { > return currentDoc = NO_MORE_DOCS; > } > if (target <= currentDoc) { > return currentDoc; > } > .... > ------------------- > for most case if (target <= currentDoc) it will return currentDoc; > but if previous advance will make sub scorers exhausted, then if may > return NO_MORE_DOCS > an example is: > currentDoc=-1 > minimumNrMatchers=1 > subScorers: > TermScorer: docIds: [1, 2, 6] > TermScorer: docIds: [2, 4] > after first call advance(5) > currentDoc=6 > only first scorer is now in the heap, scorerDocQueue.size()==1 > then call advance(6) > because scorerDocQueue.size() < minimumNrMatchers, it just return > NO_MORE_DOCS > > My question is why the advance(int target) method is defined like this? > for the reason of efficient or any other reasons? > > -- Sincerely yours Mikhail Khludnev gedel@yandex.ru --047d7b339cf7045b3304bddeeaab Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello,

I can't help with the particular question, but can share = some experience. My task is roughly the same I've found the patch https://issues.apa= che.org/jira/browse/LUCENE-2686 is absolutely useful (with one small ad= dition, I'll post it in comments soon). By using it I have disjunction = summing query with steady subscorers.

Regards

On Tue, Apr 17, 2012 at 2:37 = PM, Li Li <fanc= yerii@gmail.com> wrote:
hi all,
=A0=A0=A0 I am now hacking the BooleanScorer2 to let it keep the= docID() of the leaf scorer(mostly possible TermScorer) the same as the top= -level Scorer. Why I want to do this is: When I Collect a doc, I want to kn= ow which term is matched(especially for BooleanClause whose Occur is SHOULD= ). we have discussed some solutions, such as adding bit masks in disjunctio= n scorers. with this method, when we finds a matched doc, we can recursivel= y find which leaf scorer is matched. But we think it's not very efficie= nt and not convenient to use(this is my proposal but not agreed by others i= n our team). and then we came up with another one: Modifying DisjunctionSum= Scorer.
=A0=A0 we analysed the codes and found that the only Scorers used by Boolea= nScorer2 that will make the children scorers' docID() not equal to pare= nt is an anonymous class inherited from DisjunctionSumScorer. All other one= s including SingleMatchScorer, countingConjunctionSumScorer(anonymous), dua= lConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. =A0=A0 The implementation algorithm of DisjunctionSumScorer use a heap to f= ind the smallest doc. after finding a matched doc, the currentDoc is the ma= tched doc and all the scorers in the heap will call nextDoc() so all of the= scorers' current docID the nextDoc of currentDoc. if there are N level= DisjunctionSumScorer, the leaf scorer's current doc is the n-th next d= ocId of the root of the scorer tree.
=A0=A0 So we modify the DisjuctionSumScorer and let it behavior as we expec= ted. And then I wrote some TestCase and it works well. And also I wrote som= e random generated TermScorer and compared the nextDoc(),score() and advanc= e(int) method of original DisjunctionSumScorer and modified one. nextDoc() = and score() and exactly the same. But for advance(int target), we found som= e interesting and strange things.
=A0=A0 at the beginning, I think if target is less than current docID, it w= ill just return current docID and do nothing. this assumption let my algori= thm go wrong. Then I read the codes of TermScorer and found each call of ad= vance(int) of TermScorer will call nextDoc() no matter whether current docI= D is larger than target or not.
=A0=A0 So I am confused and then read the javadoc of DocIdSetIterator:
-= ---------------- javadoc of DocIdSetIterator.advance(int target)-----------= --

int org.apache.lucene.search.DocIdSetIterator.advance(int target)= throws IOException

Advances to the first beyond (see NOTE below) the current whose documen= t number is greater than or equal
=A0to target. Returns the current doc= ument number or NO_MORE_DOCS if there are no more docs in the set.
Beha= ves as if written:
=A0int advance(int target) {
=A0=A0 int doc;
=A0=A0 while ((doc =3D n= extDoc()) < target) {
=A0=A0 }
=A0=A0 return doc;
=A0}
=A0So= me implementations are considerably more efficient than that.
NOTE: whe= n target < current implementations may opt not to advance beyond their c= urrent docID().
NOTE: this method may be called with NO_MORE_DOCS for efficiency by some Sc= orers. If your
=A0implementation cannot efficiently determine that it s= hould exhaust, it is recommended that you check for
=A0that value in ea= ch call to this method.
NOTE: after the iterator has exhausted you should not call this method, as = it may result in unpredicted
=A0behavior. =A0=A0
------------------= --------------------
Then I modified my algorithm again and found that D= isjunctionSumScorer.advance(int target) has some strange behavior. most of = the cases, it will return currentDoc if target < currentDoc. but in some= boundary condition, it will not.
it's not a bug but let me sad. I thought my algorithm has some bug beca= use it's advance method is not exactly the same as original Disjunction= SumScorer's.
----codes of DisjunctionSumScorer---
=A0 @Override =A0 public int advance(int target) throws IOException {
=A0=A0=A0 if (sc= orerDocQueue.size() < minimumNrMatchers) {
=A0=A0=A0=A0=A0 return cur= rentDoc =3D NO_MORE_DOCS;
=A0=A0=A0 }
=A0=A0=A0 if (target <=3D cu= rrentDoc) {
=A0=A0=A0=A0=A0 return currentDoc;
=A0=A0=A0 }
=A0=A0 ....
-------------------
for most case if (targ= et <=3D currentDoc) it will return currentDoc;
but if previous advanc= e will make sub scorers exhausted, then if may return NO_MORE_DOCS
an ex= ample is:
=A0=A0 currentDoc=3D-1
=A0=A0 minimumNrMatchers=3D1
=A0=A0 subScorers= :
=A0=A0=A0=A0=A0 TermScorer: docIds: [1, 2, 6]
=A0=A0=A0=A0=A0 Term= Scorer: docIds: [2, 4]
after first call advance(5)
=A0=A0=A0 currentD= oc=3D6
=A0=A0=A0 only first scorer is now in the heap, scorerDocQueue.si= ze()=3D=3D1
then call advance(6)
=A0=A0=A0 because scorerDocQueue.size() < minimu= mNrMatchers, it just return NO_MORE_DOCS

My question is why the adva= nce(int target) method is defined like this? for the reason of efficient or= any other reasons?
=A0=A0=A0



--
Sincerely yours
Mikh= ail Khludnev

--047d7b339cf7045b3304bddeeaab--