lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Tue, 08 Jun 2010 17:44:59 GMT
Shai:

   Java cannot inline in this case.

   Actually there is an urban legend around using final to hint to
underlying compiler to inline :) (turns out to be false, one reason being
dynamic classloading)

   write a simple pgm and try and see for yourself (remember to turn on
-server on VM options)

-John

On Tue, Jun 8, 2010 at 10:28 AM, Shai Erera <serera@gmail.com> wrote:

> What do you mean "we are not inlining"? The compiler inlines methods .. at
> least it tries.
>
> Shai
>
>
> On Tue, Jun 8, 2010 at 8:21 PM, John Wang <john.wang@gmail.com> wrote:
>
>> Shai:
>>
>>     method call overhead in this case is not insignificant because it is
>> in a very tight loop, and no, compiler cannot optimize it for you, we are
>> not inline-ing cuz we are in a java world.
>>
>>      You are right, this breaks backward compatibility. But from 2.4 -
>> 2.9, we have done MUCH worse. :)
>>
>> -John
>>
>>
>> On Tue, Jun 8, 2010 at 10:09 AM, Shai Erera <serera@gmail.com> wrote:
>>
>>> I guess I must be missing something fundamental here :).
>>>
>>> If Scorer is defined as you propose, and I create my Scorer which impls
>>> getDISI() as "return this" - what do I lose? What's wrong w/ Scorer already
>>> being a DISI?
>>>
>>> You mention "it is just inefficient to pay the method call overhead ..."
>>> - what overhead? Are you talking about the decorator delegating the call to
>>> the wrapped scorer? I really think the compiler can handle that, no?
>>> Especially if you make your nextDoc/advance final (which probably you
>>> should) ...
>>>
>>> That doesn't seem to justify an API change, break bw completely (even if
>>> we do it in 4.0 only) and change all the current Scorers ...
>>>
>>> Shai
>>>
>>>
>>> On Tue, Jun 8, 2010 at 8:01 PM, John Wang <john.wang@gmail.com> wrote:
>>>
>>>> re: But Scorer is itself an iterator, so what prevents you from calling
>>>>
>>>> nextDoc and advance on it without score()
>>>>
>>>> Nothing. It is just inefficient to pay the method call overhead just to
>>>> overload score.
>>>>
>>>> re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>>>>
>>>> is not good enough I'd just develop my own.
>>>>
>>>> That is what I am doing. I am just proposing the change (see my first
>>>> email) as an improvement.
>>>>
>>>> re: Scorer is itself an iterator
>>>>
>>>> yes, that is the current definition. The point of the proposal is to
>>>> make this change.
>>>>
>>>> -John
>>>>
>>>> On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com> wrote:
>>>>
>>>>> Well … I don't know the reason as well and always thought Scorer and
>>>>> Similarity are confusing.
>>>>>
>>>>> But Scorer is itself an iterator, so what prevents you from calling
>>>>> nextDoc and advance on it without score(). And what would the returned
>>>>> DISI do when nextDoc is called, if not delegate to its subs?
>>>>>
>>>>> If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
>>>>> is not good enough I'd just develop my own.
>>>>>
>>>>> But perhaps others think differently?
>>>>>
>>>>> Shai
>>>>>
>>>>> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com> wrote:
>>>>> > Hi Shai:
>>>>> >     I am not sure I understand how changing Similarity would solve
>>>>> this problem, wouldn't you need the reader?
>>>>> >     As for PayloadTermQuery, payload is not always the most efficient
>>>>> way of storing such data, especially when number of terms << numdocs.
(I am
>>>>> not sure accessing the payload when you iterate is a good idea, but that
is
>>>>> another discussion)
>>>>> >
>>>>> >     Yes, what I described is exactly a simple CustomScoreQuery for
a
>>>>> special use-case. The problem is also in CustomScoreQuery, where nextDoc
and
>>>>> advance are calling the sub-scorers as a wrapper. This can be avoided
if the
>>>>> Scorer returns an iterator instead.
>>>>> >
>>>>> >     Separating scoring and doc iteration is a good idea anyway.
I
>>>>> don't know the reason to combine them originally.
>>>>> > Thanks
>>>>> > -John
>>>>> >
>>>>> >
>>>>> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <serera@gmail.com>
wrote:
>>>>> >
>>>>> > So wouldn't it make sense to add some method to Similarity? Which
>>>>> receives the doc Id in question maybe ... just thinking here.
>>>>> >
>>>>> > Factoring Scorer like you propose would create 3 objects for
>>>>> scoring/iterating: Scorer (which really becomes an iterator), Similarity
and
>>>>> CustomScoreFunction ...
>>>>> >
>>>>> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends
how
>>>>> you compute your age decay function (where you pull the data about the
age
>>>>> of the document).
>>>>> >
>>>>> > Shai
>>>>> >
>>>>> >
>>>>> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.wang@gmail.com>
>>>>> wrote:
>>>>> > Hi Shai:
>>>>> >     Similarity in many cases is not sufficient for scoring. For
>>>>> example, to implement age decaying of a document (very useful for corpuses
>>>>> like news or tweets), you want to project the raw tfidf score onto a
time
>>>>> curve, say f(x), to do this, you'd have a custom scorer that decorates
the
>>>>> underlying scorer from your say, boolean query:
>>>>> >
>>>>> >
>>>>> >
>>>>> > public float score(){    return myFunc(innerScorer.score());}
>>>>> >     This is fine, but then you would have to do this as well:
>>>>> > public int nextDoc(){
>>>>> >
>>>>> >
>>>>> >    return innerScorer.nextDoc();}
>>>>> > and also:
>>>>> > public int advance(int target){   return innerScorer.advance();}
>>>>> The difference here is that nextDoc and advance are called far more times
as
>>>>> score. And you are introducing an extra method call for them, which is
not
>>>>> insignificant for queries result in large recall sets.
>>>>> >
>>>>> >
>>>>> >
>>>>> > Hope this makes sense.
>>>>> > Thanks
>>>>> > -John
>>>>> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com>
wrote:
>>>>> > I'm not sure I understand what you mean - Scorer is a DISI itself,
>>>>> and the scoring formula is mostly controlled by Similarity.
>>>>> >
>>>>> > What will be the benefits of the proposed change?
>>>>> >
>>>>> > Shai
>>>>> >
>>>>> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.wang@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Hi guys:
>>>>> >
>>>>> >     I'd like to make a proposal to change the Scorer class/api to
the
>>>>> following:
>>>>> >
>>>>> >
>>>>> > public abstract class Scorer{
>>>>> >    DocIdSetIterator getDocIDSetIterator();
>>>>> >    float score(int docid);
>>>>> > }
>>>>> >
>>>>> > Reasons:
>>>>> >
>>>>> > 1) To build a Scorer from an existing Scorer (e.g. that produces
raw
>>>>> scores from tfidf), one would decorate it, and it would introduce overhead
>>>>> (in function calls) around nextDoc and advance, even if you just want
to
>>>>> augment the score method which is called much fewer times.
>>>>> >
>>>>> > 2) The current contract forces scoring on the currentDoc in the
>>>>> underlying iterator. So once you pass "current", you can no longer score.
In
>>>>> one of our use-cases, it is very inconvenient.
>>>>> >
>>>>> > What do you think? I can go ahead and open an issue and work on
a
>>>>> patch if I get some agreement.
>>>>> >
>>>>> > Thanks
>>>>> >
>>>>> > -John
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message