lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Tue, 08 Jun 2010 17:25:59 GMT
Hi Earwin:

     I am not sure I understand here, e.g. what si the difference between:

     float myscorinCode(){
         computeMyScore(scorer.score());
     }

     and

      float myscorinCode(){
         computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
      }

      In the case of BQ, when you get a hit, would you still be able to call
subscorer.score(hit)? Why is the point of iteration important for BQ?

      please elaborate.

Thanks

-John

On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <earwin@gmail.com> wrote:

> The problem with your proposal is that, currently, Lucene uses current
> iteration state to compute score.
> I.e. it already knows which of SHOULD BQ clauses matched for current
> doc, so it's easier to calculate the score.
> If you change API to allow scoring arbitrary documents (even those
> that didn't match the query at all), you're opening a can of worms :)
>
> As an alternative, you can try looking at MG4J sources. As far as I
> understand, their scoring is decoupled from matching, just like you
> (and I bet many more people) want. The matcher is separate, and the
> scoring entity accepts current matcher state instead of document id,
> so you get the best of both worlds.
>
> On Tue, Jun 8, 2010 at 21:01, John Wang <john.wang@gmail.com> wrote:
> > re: But Scorer is itself an iterator, so what prevents you from calling
> > nextDoc and advance on it without score()
> >
> > Nothing. It is just inefficient to pay the method call overhead just to
> > overload score.
> >
> > re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
> > is not good enough I'd just develop my own.
> >
> > That is what I am doing. I am just proposing the change (see my first
> email)
> > as an improvement.
> >
> > re: Scorer is itself an iterator
> >
> > yes, that is the current definition. The point of the proposal is to make
> > this change.
> >
> > -John
> >
> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com> wrote:
> >>
> >> Well ... I don't know the reason as well and always thought Scorer and
> >> Similarity are confusing.
> >>
> >> But Scorer is itself an iterator, so what prevents you from calling
> >> nextDoc and advance on it without score(). And what would the returned
> >> DISI do when nextDoc is called, if not delegate to its subs?
> >>
> >> If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
> >> is not good enough I'd just develop my own.
> >>
> >> But perhaps others think differently?
> >>
> >> Shai
> >>
> >> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com> wrote:
> >> > Hi Shai:
> >> >     I am not sure I understand how changing Similarity would solve
> this
> >> > problem, wouldn't you need the reader?
> >> >     As for PayloadTermQuery, payload is not always the most efficient
> >> > way of storing such data, especially when number of terms << numdocs.
> (I am
> >> > not sure accessing the payload when you iterate is a good idea, but
> that is
> >> > another discussion)
> >> >
> >> >     Yes, what I described is exactly a simple CustomScoreQuery for a
> >> > special use-case. The problem is also in CustomScoreQuery, where
> nextDoc and
> >> > advance are calling the sub-scorers as a wrapper. This can be avoided
> if the
> >> > Scorer returns an iterator instead.
> >> >
> >> >     Separating scoring and doc iteration is a good idea anyway. I
> don't
> >> > know the reason to combine them originally.
> >> > Thanks
> >> > -John
> >> >
> >> >
> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <serera@gmail.com> wrote:
> >> >
> >> > So wouldn't it make sense to add some method to Similarity? Which
> >> > receives the doc Id in question maybe ... just thinking here.
> >> >
> >> > Factoring Scorer like you propose would create 3 objects for
> >> > scoring/iterating: Scorer (which really becomes an iterator),
> Similarity and
> >> > CustomScoreFunction ...
> >> >
> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how
> you
> >> > compute your age decay function (where you pull the data about the age
> of
> >> > the document).
> >> >
> >> > Shai
> >> >
> >> >
> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.wang@gmail.com>
> wrote:
> >> > Hi Shai:
> >> >     Similarity in many cases is not sufficient for scoring. For
> example,
> >> > to implement age decaying of a document (very useful for corpuses like
> news
> >> > or tweets), you want to project the raw tfidf score onto a time curve,
> say
> >> > f(x), to do this, you'd have a custom scorer that decorates the
> underlying
> >> > scorer from your say, boolean query:
> >> >
> >> >
> >> >
> >> > public float score(){    return myFunc(innerScorer.score());}
> >> >     This is fine, but then you would have to do this as well:
> >> > public int nextDoc(){
> >> >
> >> >
> >> >    return innerScorer.nextDoc();}
> >> > and also:
> >> > public int advance(int target){   return innerScorer.advance();}
> >> > The difference here is that nextDoc and advance are called far more
> times as
> >> > score. And you are introducing an extra method call for them, which is
> not
> >> > insignificant for queries result in large recall sets.
> >> >
> >> >
> >> >
> >> > Hope this makes sense.
> >> > Thanks
> >> > -John
> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com> wrote:
> >> > I'm not sure I understand what you mean - Scorer is a DISI itself, and
> >> > the scoring formula is mostly controlled by Similarity.
> >> >
> >> > What will be the benefits of the proposed change?
> >> >
> >> > Shai
> >> >
> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.wang@gmail.com>
> wrote:
> >> >
> >> >
> >> >
> >> >
> >> > Hi guys:
> >> >
> >> >     I'd like to make a proposal to change the Scorer class/api to the
> >> > following:
> >> >
> >> >
> >> > public abstract class Scorer{
> >> >    DocIdSetIterator getDocIDSetIterator();
> >> >    float score(int docid);
> >> > }
> >> >
> >> > Reasons:
> >> >
> >> > 1) To build a Scorer from an existing Scorer (e.g. that produces raw
> >> > scores from tfidf), one would decorate it, and it would introduce
> overhead
> >> > (in function calls) around nextDoc and advance, even if you just want
> to
> >> > augment the score method which is called much fewer times.
> >> >
> >> > 2) The current contract forces scoring on the currentDoc in the
> >> > underlying iterator. So once you pass "current", you can no longer
> score. In
> >> > one of our use-cases, it is very inconvenient.
> >> >
> >> > What do you think? I can go ahead and open an issue and work on a
> patch
> >> > if I get some agreement.
> >> >
> >> > Thanks
> >> >
> >> > -John
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message