lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Tue, 08 Jun 2010 17:54:56 GMT
To compute a score you have to see which of your subqueries did not
match, which did, and what are the docfreqs/positions for them.
When iterating, and calling score() only for current doc - parts of
this data (maybe even all of it, not sure) is already gathered for
you. If you allow calling score(int doc) - for arbitrary docId, you'll
have to redo this work.

2010/6/8 John Wang <john.wang@gmail.com>:
> Hi Earwin:
>
>      I am not sure I understand here, e.g. what si the difference between:
>
>      float myscorinCode(){
>          computeMyScore(scorer.score());
>      }
>
>      and
>
>       float myscorinCode(){
>          computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
>       }
>
>       In the case of BQ, when you get a hit, would you still be able to call
> subscorer.score(hit)? Why is the point of iteration important for BQ?
>
>       please elaborate.
>
> Thanks
>
> -John
>
> On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
>>
>> The problem with your proposal is that, currently, Lucene uses current
>> iteration state to compute score.
>> I.e. it already knows which of SHOULD BQ clauses matched for current
>> doc, so it's easier to calculate the score.
>> If you change API to allow scoring arbitrary documents (even those
>> that didn't match the query at all), you're opening a can of worms :)
>>
>> As an alternative, you can try looking at MG4J sources. As far as I
>> understand, their scoring is decoupled from matching, just like you
>> (and I bet many more people) want. The matcher is separate, and the
>> scoring entity accepts current matcher state instead of document id,
>> so you get the best of both worlds.
>>
>> On Tue, Jun 8, 2010 at 21:01, John Wang <john.wang@gmail.com> wrote:
>> > re: But Scorer is itself an iterator, so what prevents you from calling
>> > nextDoc and advance on it without score()
>> >
>> > Nothing. It is just inefficient to pay the method call overhead just to
>> > overload score.
>> >
>> > re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>> > is not good enough I'd just develop my own.
>> >
>> > That is what I am doing. I am just proposing the change (see my first
>> > email)
>> > as an improvement.
>> >
>> > re: Scorer is itself an iterator
>> >
>> > yes, that is the current definition. The point of the proposal is to
>> > make
>> > this change.
>> >
>> > -John
>> >
>> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com> wrote:
>> >>
>> >> Well … I don't know the reason as well and always thought Scorer and
>> >> Similarity are confusing.
>> >>
>> >> But Scorer is itself an iterator, so what prevents you from calling
>> >> nextDoc and advance on it without score(). And what would the returned
>> >> DISI do when nextDoc is called, if not delegate to its subs?
>> >>
>> >> If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>> >> is not good enough I'd just develop my own.
>> >>
>> >> But perhaps others think differently?
>> >>
>> >> Shai
>> >>
>> >> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com> wrote:
>> >> > Hi Shai:
>> >> >     I am not sure I understand how changing Similarity would solve
>> >> > this
>> >> > problem, wouldn't you need the reader?
>> >> >     As for PayloadTermQuery, payload is not always the most efficient
>> >> > way of storing such data, especially when number of terms <<
numdocs.
>> >> > (I am
>> >> > not sure accessing the payload when you iterate is a good idea, but
>> >> > that is
>> >> > another discussion)
>> >> >
>> >> >     Yes, what I described is exactly a simple CustomScoreQuery
for a
>> >> > special use-case. The problem is also in CustomScoreQuery, where
>> >> > nextDoc and
>> >> > advance are calling the sub-scorers as a wrapper. This can be avoided
>> >> > if the
>> >> > Scorer returns an iterator instead.
>> >> >
>> >> >     Separating scoring and doc iteration is a good idea anyway.
I
>> >> > don't
>> >> > know the reason to combine them originally.
>> >> > Thanks
>> >> > -John
>> >> >
>> >> >
>> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <serera@gmail.com>
wrote:
>> >> >
>> >> > So wouldn't it make sense to add some method to Similarity? Which
>> >> > receives the doc Id in question maybe ... just thinking here.
>> >> >
>> >> > Factoring Scorer like you propose would create 3 objects for
>> >> > scoring/iterating: Scorer (which really becomes an iterator),
>> >> > Similarity and
>> >> > CustomScoreFunction ...
>> >> >
>> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how
>> >> > you
>> >> > compute your age decay function (where you pull the data about the
>> >> > age of
>> >> > the document).
>> >> >
>> >> > Shai
>> >> >
>> >> >
>> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.wang@gmail.com>
>> >> > wrote:
>> >> > Hi Shai:
>> >> >     Similarity in many cases is not sufficient for scoring. For
>> >> > example,
>> >> > to implement age decaying of a document (very useful for corpuses
>> >> > like news
>> >> > or tweets), you want to project the raw tfidf score onto a time
>> >> > curve, say
>> >> > f(x), to do this, you'd have a custom scorer that decorates the
>> >> > underlying
>> >> > scorer from your say, boolean query:
>> >> >
>> >> >
>> >> >
>> >> > public float score(){    return myFunc(innerScorer.score());}
>> >> >     This is fine, but then you would have to do this as well:
>> >> > public int nextDoc(){
>> >> >
>> >> >
>> >> >    return innerScorer.nextDoc();}
>> >> > and also:
>> >> > public int advance(int target){   return innerScorer.advance();}
>> >> > The difference here is that nextDoc and advance are called far more
>> >> > times as
>> >> > score. And you are introducing an extra method call for them, which
>> >> > is not
>> >> > insignificant for queries result in large recall sets.
>> >> >
>> >> >
>> >> >
>> >> > Hope this makes sense.
>> >> > Thanks
>> >> > -John
>> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com>
wrote:
>> >> > I'm not sure I understand what you mean - Scorer is a DISI itself,
>> >> > and
>> >> > the scoring formula is mostly controlled by Similarity.
>> >> >
>> >> > What will be the benefits of the proposed change?
>> >> >
>> >> > Shai
>> >> >
>> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.wang@gmail.com>
>> >> > wrote:
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Hi guys:
>> >> >
>> >> >     I'd like to make a proposal to change the Scorer class/api to
the
>> >> > following:
>> >> >
>> >> >
>> >> > public abstract class Scorer{
>> >> >    DocIdSetIterator getDocIDSetIterator();
>> >> >    float score(int docid);
>> >> > }
>> >> >
>> >> > Reasons:
>> >> >
>> >> > 1) To build a Scorer from an existing Scorer (e.g. that produces raw
>> >> > scores from tfidf), one would decorate it, and it would introduce
>> >> > overhead
>> >> > (in function calls) around nextDoc and advance, even if you just want
>> >> > to
>> >> > augment the score method which is called much fewer times.
>> >> >
>> >> > 2) The current contract forces scoring on the currentDoc in the
>> >> > underlying iterator. So once you pass "current", you can no longer
>> >> > score. In
>> >> > one of our use-cases, it is very inconvenient.
>> >> >
>> >> > What do you think? I can go ahead and open an issue and work on a
>> >> > patch
>> >> > if I get some agreement.
>> >> >
>> >> > Thanks
>> >> >
>> >> > -John
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>> Phone: +7 (495) 683-567-4
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message