Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of mkhludnev@griddynamics.com
 designates 209.85.160.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFAd71Vkq0MRe+5QSxphPpW9hbS52A_Mta=JX33G-PahAYTdyA@mail.gmail.com>
References: 
 <CAFAd71Vkq0MRe+5QSxphPpW9hbS52A_Mta=JX33G-PahAYTdyA@mail.gmail.com>
From: Mikhail Khludnev <mkhludnev@griddynamics.com>
Date: Tue, 17 Apr 2012 16:16:12 +0400
Message-ID: 
 <CANGii8eBkLKc3PruVs+rEHYR3LyXjVJ0WVB0z3qSr_FFoVXnng@mail.gmail.com>
Subject: Re: why the of advance(int target) function of DocIdSetIterator is
 defined with uncertain?
To: dev@lucene.apache.org
Cc: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=047d7b339cf7045b3304bddeeaab

--047d7b339cf7045b3304bddeeaab
Content-Type: text/plain; charset=ISO-8859-1

Hello,

I can't help with the particular question, but can share some experience.
My task is roughly the same I've found the patch
https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
(with one small addition, I'll post it in comments soon). By using it I
have disjunction summing query with steady subscorers.

Regards

On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fancyerii@gmail.com> wrote:

> hi all,
>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>    we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>    The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
>    So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
>    at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
>    So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
>  to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
>  int advance(int target) {
>    int doc;
>    while ((doc = nextDoc()) < target) {
>    }
>    return doc;
>  }
>  Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
>  implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
>  that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
>  behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
>   @Override
>   public int advance(int target) throws IOException {
>     if (scorerDocQueue.size() < minimumNrMatchers) {
>       return currentDoc = NO_MORE_DOCS;
>     }
>     if (target <= currentDoc) {
>       return currentDoc;
>     }
>    ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
>    currentDoc=-1
>    minimumNrMatchers=1
>    subScorers:
>       TermScorer: docIds: [1, 2, 6]
>       TermScorer: docIds: [2, 4]
> after first call advance(5)
>     currentDoc=6
>     only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
>     because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>


-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

--047d7b339cf7045b3304bddeeaab
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello,<br><br>I can&#39;t help with the particular question, but can share =
some experience. My task is roughly the same I&#39;ve found the patch <a hr=
ef=3D"https://issues.apache.org/jira/browse/LUCENE-2686">https://issues.apa=
che.org/jira/browse/LUCENE-2686</a> is absolutely useful (with one small ad=
dition, I&#39;ll post it in comments soon). By using it I have disjunction =
summing query with steady subscorers. <br>

<br>Regards<br><br><div class=3D"gmail_quote">On Tue, Apr 17, 2012 at 2:37 =
PM, Li Li <span dir=3D"ltr">&lt;<a href=3D"mailto:fancyerii@gmail.com">fanc=
yerii@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

hi all,<br>=A0=A0=A0 I am now hacking the BooleanScorer2 to let it keep the=
 docID() of the leaf scorer(mostly possible TermScorer) the same as the top=
-level Scorer. Why I want to do this is: When I Collect a doc, I want to kn=
ow which term is matched(especially for BooleanClause whose Occur is SHOULD=
). we have discussed some solutions, such as adding bit masks in disjunctio=
n scorers. with this method, when we finds a matched doc, we can recursivel=
y find which leaf scorer is matched. But we think it&#39;s not very efficie=
nt and not convenient to use(this is my proposal but not agreed by others i=
n our team). and then we came up with another one: Modifying DisjunctionSum=
Scorer.<br>


=A0=A0 we analysed the codes and found that the only Scorers used by Boolea=
nScorer2 that will make the children scorers&#39; docID() not equal to pare=
nt is an anonymous class inherited from DisjunctionSumScorer. All other one=
s including SingleMatchScorer, countingConjunctionSumScorer(anonymous), dua=
lConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.<b=
r>


=A0=A0 The implementation algorithm of DisjunctionSumScorer use a heap to f=
ind the smallest doc. after finding a matched doc, the currentDoc is the ma=
tched doc and all the scorers in the heap will call nextDoc() so all of the=
 scorers&#39; current docID the nextDoc of currentDoc. if there are N level=
 DisjunctionSumScorer, the leaf scorer&#39;s current doc is the n-th next d=
ocId of the root of the scorer tree.<br>


=A0=A0 So we modify the DisjuctionSumScorer and let it behavior as we expec=
ted. And then I wrote some TestCase and it works well. And also I wrote som=
e random generated TermScorer and compared the nextDoc(),score() and advanc=
e(int) method of original DisjunctionSumScorer and modified one. nextDoc() =
and score() and exactly the same. But for advance(int target), we found som=
e interesting and strange things.<br>


=A0=A0 at the beginning, I think if target is less than current docID, it w=
ill just return current docID and do nothing. this assumption let my algori=
thm go wrong. Then I read the codes of TermScorer and found each call of ad=
vance(int) of TermScorer will call nextDoc() no matter whether current docI=
D is larger than target or not.<br>


=A0=A0 So I am confused and then read the javadoc of DocIdSetIterator:<br>-=
---------------- javadoc of DocIdSetIterator.advance(int target)-----------=
--<br><br>int org.apache.lucene.search.DocIdSetIterator.advance(int target)=
 throws IOException<br>


<br>Advances to the first beyond (see NOTE below) the current whose documen=
t number is greater than or equal <br>=A0to target. Returns the current doc=
ument number or NO_MORE_DOCS if there are no more docs in the set. <br>Beha=
ves as if written: <br>


=A0int advance(int target) {<br>=A0=A0 int doc;<br>=A0=A0 while ((doc =3D n=
extDoc()) &lt; target) {<br>=A0=A0 }<br>=A0=A0 return doc;<br>=A0}<br>=A0So=
me implementations are considerably more efficient than that. <br>NOTE: whe=
n target &lt; current implementations may opt not to advance beyond their c=
urrent docID(). <br>


NOTE: this method may be called with NO_MORE_DOCS for efficiency by some Sc=
orers. If your <br>=A0implementation cannot efficiently determine that it s=
hould exhaust, it is recommended that you check for <br>=A0that value in ea=
ch call to this method. <br>


NOTE: after the iterator has exhausted you should not call this method, as =
it may result in unpredicted <br>=A0behavior. =A0=A0 <br>------------------=
--------------------<br>Then I modified my algorithm again and found that D=
isjunctionSumScorer.advance(int target) has some strange behavior. most of =
the cases, it will return currentDoc if target &lt; currentDoc. but in some=
 boundary condition, it will not.<br>


it&#39;s not a bug but let me sad. I thought my algorithm has some bug beca=
use it&#39;s advance method is not exactly the same as original Disjunction=
SumScorer&#39;s.<br>----codes of DisjunctionSumScorer---<br>=A0 @Override<b=
r>


=A0 public int advance(int target) throws IOException {<br>=A0=A0=A0 if (sc=
orerDocQueue.size() &lt; minimumNrMatchers) {<br>=A0=A0=A0=A0=A0 return cur=
rentDoc =3D NO_MORE_DOCS;<br>=A0=A0=A0 }<br>=A0=A0=A0 if (target &lt;=3D cu=
rrentDoc) {<br>=A0=A0=A0=A0=A0 return currentDoc;<br>


=A0=A0=A0 }<br>=A0=A0 ....<br>-------------------<br>for most case if (targ=
et &lt;=3D currentDoc) it will return currentDoc;<br>but if previous advanc=
e will make sub scorers exhausted, then if may return NO_MORE_DOCS<br>an ex=
ample is:<br>


=A0=A0 currentDoc=3D-1<br>=A0=A0 minimumNrMatchers=3D1<br>=A0=A0 subScorers=
: <br>=A0=A0=A0=A0=A0 TermScorer: docIds: [1, 2, 6]<br>=A0=A0=A0=A0=A0 Term=
Scorer: docIds: [2, 4]<br>after first call advance(5)<br>=A0=A0=A0 currentD=
oc=3D6<br>=A0=A0=A0 only first scorer is now in the heap, scorerDocQueue.si=
ze()=3D=3D1<br>


then call advance(6)<br>=A0=A0=A0 because scorerDocQueue.size() &lt; minimu=
mNrMatchers, it just return NO_MORE_DOCS<br><br>My question is why the adva=
nce(int target) method is defined like this? for the reason of efficient or=
 any other reasons?<br>


=A0=A0=A0 <br>
</blockquote></div><br><br clear=3D"all"><br>-- <br>Sincerely yours<br>Mikh=
ail Khludnev<br><div><a href=3D"mailto:gedel@yandex.ru" target=3D"_blank">g=
edel@yandex.ru</a><br><br><a href=3D"http://www.griddynamics.com" target=3D=
"_blank"></a><a href=3D"mailto:mkhludnev@griddynamics.com" target=3D"_blank=
"><br>

</a></div><br>

--047d7b339cf7045b3304bddeeaab--