lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Tue, 08 Jun 2010 17:28:47 GMT
What do you mean "we are not inlining"? The compiler inlines methods .. at
least it tries.

Shai

On Tue, Jun 8, 2010 at 8:21 PM, John Wang <john.wang@gmail.com> wrote:

> Shai:
>
>     method call overhead in this case is not insignificant because it is in
> a very tight loop, and no, compiler cannot optimize it for you, we are not
> inline-ing cuz we are in a java world.
>
>      You are right, this breaks backward compatibility. But from 2.4 - 2.9,
> we have done MUCH worse. :)
>
> -John
>
>
> On Tue, Jun 8, 2010 at 10:09 AM, Shai Erera <serera@gmail.com> wrote:
>
>> I guess I must be missing something fundamental here :).
>>
>> If Scorer is defined as you propose, and I create my Scorer which impls
>> getDISI() as "return this" - what do I lose? What's wrong w/ Scorer already
>> being a DISI?
>>
>> You mention "it is just inefficient to pay the method call overhead ..." -
>> what overhead? Are you talking about the decorator delegating the call to
>> the wrapped scorer? I really think the compiler can handle that, no?
>> Especially if you make your nextDoc/advance final (which probably you
>> should) ...
>>
>> That doesn't seem to justify an API change, break bw completely (even if
>> we do it in 4.0 only) and change all the current Scorers ...
>>
>> Shai
>>
>>
>> On Tue, Jun 8, 2010 at 8:01 PM, John Wang <john.wang@gmail.com> wrote:
>>
>>> re: But Scorer is itself an iterator, so what prevents you from calling
>>>
>>> nextDoc and advance on it without score()
>>>
>>> Nothing. It is just inefficient to pay the method call overhead just to
>>> overload score.
>>>
>>> re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>>>
>>> is not good enough I'd just develop my own.
>>>
>>> That is what I am doing. I am just proposing the change (see my first
>>> email) as an improvement.
>>>
>>> re: Scorer is itself an iterator
>>>
>>> yes, that is the current definition. The point of the proposal is to make
>>> this change.
>>>
>>> -John
>>>
>>> On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com> wrote:
>>>
>>>> Well … I don't know the reason as well and always thought Scorer and
>>>> Similarity are confusing.
>>>>
>>>> But Scorer is itself an iterator, so what prevents you from calling
>>>> nextDoc and advance on it without score(). And what would the returned
>>>> DISI do when nextDoc is called, if not delegate to its subs?
>>>>
>>>> If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>>>> is not good enough I'd just develop my own.
>>>>
>>>> But perhaps others think differently?
>>>>
>>>> Shai
>>>>
>>>> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com> wrote:
>>>> > Hi Shai:
>>>> >     I am not sure I understand how changing Similarity would solve
>>>> this problem, wouldn't you need the reader?
>>>> >     As for PayloadTermQuery, payload is not always the most efficient
>>>> way of storing such data, especially when number of terms << numdocs.
(I am
>>>> not sure accessing the payload when you iterate is a good idea, but that
is
>>>> another discussion)
>>>> >
>>>> >     Yes, what I described is exactly a simple CustomScoreQuery for a
>>>> special use-case. The problem is also in CustomScoreQuery, where nextDoc
and
>>>> advance are calling the sub-scorers as a wrapper. This can be avoided if
the
>>>> Scorer returns an iterator instead.
>>>> >
>>>> >     Separating scoring and doc iteration is a good idea anyway. I
>>>> don't know the reason to combine them originally.
>>>> > Thanks
>>>> > -John
>>>> >
>>>> >
>>>> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <serera@gmail.com>
wrote:
>>>> >
>>>> > So wouldn't it make sense to add some method to Similarity? Which
>>>> receives the doc Id in question maybe ... just thinking here.
>>>> >
>>>> > Factoring Scorer like you propose would create 3 objects for
>>>> scoring/iterating: Scorer (which really becomes an iterator), Similarity
and
>>>> CustomScoreFunction ...
>>>> >
>>>> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how
>>>> you compute your age decay function (where you pull the data about the age
>>>> of the document).
>>>> >
>>>> > Shai
>>>> >
>>>> >
>>>> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.wang@gmail.com>
>>>> wrote:
>>>> > Hi Shai:
>>>> >     Similarity in many cases is not sufficient for scoring. For
>>>> example, to implement age decaying of a document (very useful for corpuses
>>>> like news or tweets), you want to project the raw tfidf score onto a time
>>>> curve, say f(x), to do this, you'd have a custom scorer that decorates the
>>>> underlying scorer from your say, boolean query:
>>>> >
>>>> >
>>>> >
>>>> > public float score(){    return myFunc(innerScorer.score());}
>>>> >     This is fine, but then you would have to do this as well:
>>>> > public int nextDoc(){
>>>> >
>>>> >
>>>> >    return innerScorer.nextDoc();}
>>>> > and also:
>>>> > public int advance(int target){   return innerScorer.advance();}
>>>> The difference here is that nextDoc and advance are called far more times
as
>>>> score. And you are introducing an extra method call for them, which is not
>>>> insignificant for queries result in large recall sets.
>>>> >
>>>> >
>>>> >
>>>> > Hope this makes sense.
>>>> > Thanks
>>>> > -John
>>>> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com>
wrote:
>>>> > I'm not sure I understand what you mean - Scorer is a DISI itself, and
>>>> the scoring formula is mostly controlled by Similarity.
>>>> >
>>>> > What will be the benefits of the proposed change?
>>>> >
>>>> > Shai
>>>> >
>>>> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.wang@gmail.com>
>>>> wrote:
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Hi guys:
>>>> >
>>>> >     I'd like to make a proposal to change the Scorer class/api to the
>>>> following:
>>>> >
>>>> >
>>>> > public abstract class Scorer{
>>>> >    DocIdSetIterator getDocIDSetIterator();
>>>> >    float score(int docid);
>>>> > }
>>>> >
>>>> > Reasons:
>>>> >
>>>> > 1) To build a Scorer from an existing Scorer (e.g. that produces raw
>>>> scores from tfidf), one would decorate it, and it would introduce overhead
>>>> (in function calls) around nextDoc and advance, even if you just want to
>>>> augment the score method which is called much fewer times.
>>>> >
>>>> > 2) The current contract forces scoring on the currentDoc in the
>>>> underlying iterator. So once you pass "current", you can no longer score.
In
>>>> one of our use-cases, it is very inconvenient.
>>>> >
>>>> > What do you think? I can go ahead and open an issue and work on a
>>>> patch if I get some agreement.
>>>> >
>>>> > Thanks
>>>> >
>>>> > -John
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>

Mime
View raw message