lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Tue, 08 Jun 2010 22:02:06 GMT
Hi Doron:

Re: " comparing to all other IO ops and computations done by the stack of
scorers"

Lucene caches rather well and compresses well enough that the IO cache is
effective enough that you are not really paying for disk movement most of
the time. As for the stack of scores, that is actually my point, the number
of times score is called is much less than that of nextDoc/advance (sorry
have to keep on re-iterating this).

The overhead of delegation for large result sets is not insignificant
because it is called in a inner/tight loop. (again, sorry to have to
re-iterate this)

Subclassing works fine, but in the case you don't know which query type to
subclass, it is not viable.

-John


2010/6/8 Doron Cohen <cdoronc@gmail.com>

> I too tend to ignore the overhead of delegated calls, especially comparing
> to all other IO ops and computations done by the stack of scorers, but
> accepting that you cannot ignore it, could you achieve the same goal by
> sub-classing the top query where you subclass its weight to return a
> sub-class of its scorer which would only override score() but not the other
> methods, and in score would apply that eg decay logic? This way no
> delegation is required for the other methods. A disadvantage of this is that
> you would need subclass like this any kind of top level query that might
> come up in your app - so not sure if this is really acceptable in your case.
> Another disadvantage is that this is a much more complicated code to write.
>
> Doron
>
>
> 2010/6/8 John Wang <john.wang@gmail.com>
>
>> Wouldn't you get it as well with proposed api?
>> You would still be able to iterate the doc and at that point call score
>> with the docid. If you call score() along with iteration, you would still
>> get the information no?
>> Making scorer take a docid allows you score any docid in the reader if the
>> query wants it to. Wouldn't it make it more flexible?
>>
>> -John
>>
>>
>> On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot <earwin@gmail.com>wrote:
>>
>>> To compute a score you have to see which of your subqueries did not
>>> match, which did, and what are the docfreqs/positions for them.
>>> When iterating, and calling score() only for current doc - parts of
>>> this data (maybe even all of it, not sure) is already gathered for
>>> you. If you allow calling score(int doc) - for arbitrary docId, you'll
>>> have to redo this work.
>>>
>>> 2010/6/8 John Wang <john.wang@gmail.com>:
>>> > Hi Earwin:
>>> >
>>> >      I am not sure I understand here, e.g. what si the difference
>>> between:
>>> >
>>> >      float myscorinCode(){
>>> >          computeMyScore(scorer.score());
>>> >      }
>>> >
>>> >      and
>>> >
>>> >       float myscorinCode(){
>>> >
>>> computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
>>> >       }
>>> >
>>> >       In the case of BQ, when you get a hit, would you still be able to
>>> call
>>> > subscorer.score(hit)? Why is the point of iteration important for BQ?
>>> >
>>> >       please elaborate.
>>> >
>>> > Thanks
>>> >
>>> > -John
>>> >
>>> > On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <earwin@gmail.com>
>>> wrote:
>>> >>
>>> >> The problem with your proposal is that, currently, Lucene uses current
>>> >> iteration state to compute score.
>>> >> I.e. it already knows which of SHOULD BQ clauses matched for current
>>> >> doc, so it's easier to calculate the score.
>>> >> If you change API to allow scoring arbitrary documents (even those
>>> >> that didn't match the query at all), you're opening a can of worms :)
>>> >>
>>> >> As an alternative, you can try looking at MG4J sources. As far as I
>>> >> understand, their scoring is decoupled from matching, just like you
>>> >> (and I bet many more people) want. The matcher is separate, and the
>>> >> scoring entity accepts current matcher state instead of document id,
>>> >> so you get the best of both worlds.
>>> >>
>>> >> On Tue, Jun 8, 2010 at 21:01, John Wang <john.wang@gmail.com>
wrote:
>>> >> > re: But Scorer is itself an iterator, so what prevents you from
>>> calling
>>> >> > nextDoc and advance on it without score()
>>> >> >
>>> >> > Nothing. It is just inefficient to pay the method call overhead
just
>>> to
>>> >> > overload score.
>>> >> >
>>> >> > re: If I were in your shoes, I'd simply provider a Query wrapper.
If
>>> CSQ
>>> >> > is not good enough I'd just develop my own.
>>> >> >
>>> >> > That is what I am doing. I am just proposing the change (see my
>>> first
>>> >> > email)
>>> >> > as an improvement.
>>> >> >
>>> >> > re: Scorer is itself an iterator
>>> >> >
>>> >> > yes, that is the current definition. The point of the proposal
is to
>>> >> > make
>>> >> > this change.
>>> >> >
>>> >> > -John
>>> >> >
>>> >> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> Well ... I don't know the reason as well and always thought
Scorer
>>> and
>>> >> >> Similarity are confusing.
>>> >> >>
>>> >> >> But Scorer is itself an iterator, so what prevents you from
calling
>>> >> >> nextDoc and advance on it without score(). And what would the
>>> returned
>>> >> >> DISI do when nextDoc is called, if not delegate to its subs?
>>> >> >>
>>> >> >> If I were in your shoes, I'd simply provider a Query wrapper.
If
>>> CSQ
>>> >> >> is not good enough I'd just develop my own.
>>> >> >>
>>> >> >> But perhaps others think differently?
>>> >> >>
>>> >> >> Shai
>>> >> >>
>>> >> >> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com>
wrote:
>>> >> >> > Hi Shai:
>>> >> >> >     I am not sure I understand how changing Similarity
would
>>> solve
>>> >> >> > this
>>> >> >> > problem, wouldn't you need the reader?
>>> >> >> >     As for PayloadTermQuery, payload is not always the
most
>>> efficient
>>> >> >> > way of storing such data, especially when number of terms
<<
>>> numdocs.
>>> >> >> > (I am
>>> >> >> > not sure accessing the payload when you iterate is a good
idea,
>>> but
>>> >> >> > that is
>>> >> >> > another discussion)
>>> >> >> >
>>> >> >> >     Yes, what I described is exactly a simple CustomScoreQuery
>>> for a
>>> >> >> > special use-case. The problem is also in CustomScoreQuery,
where
>>> >> >> > nextDoc and
>>> >> >> > advance are calling the sub-scorers as a wrapper. This
can be
>>> avoided
>>> >> >> > if the
>>> >> >> > Scorer returns an iterator instead.
>>> >> >> >
>>> >> >> >     Separating scoring and doc iteration is a good idea
anyway. I
>>> >> >> > don't
>>> >> >> > know the reason to combine them originally.
>>> >> >> > Thanks
>>> >> >> > -John
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <serera@gmail.com>
>>> wrote:
>>> >> >> >
>>> >> >> > So wouldn't it make sense to add some method to Similarity?
Which
>>> >> >> > receives the doc Id in question maybe ... just thinking
here.
>>> >> >> >
>>> >> >> > Factoring Scorer like you propose would create 3 objects
for
>>> >> >> > scoring/iterating: Scorer (which really becomes an iterator),
>>> >> >> > Similarity and
>>> >> >> > CustomScoreFunction ...
>>> >> >> >
>>> >> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery?
depends
>>> how
>>> >> >> > you
>>> >> >> > compute your age decay function (where you pull the data
about
>>> the
>>> >> >> > age of
>>> >> >> > the document).
>>> >> >> >
>>> >> >> > Shai
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.wang@gmail.com>
>>> >> >> > wrote:
>>> >> >> > Hi Shai:
>>> >> >> >     Similarity in many cases is not sufficient for scoring.
For
>>> >> >> > example,
>>> >> >> > to implement age decaying of a document (very useful for
corpuses
>>> >> >> > like news
>>> >> >> > or tweets), you want to project the raw tfidf score onto
a time
>>> >> >> > curve, say
>>> >> >> > f(x), to do this, you'd have a custom scorer that decorates
the
>>> >> >> > underlying
>>> >> >> > scorer from your say, boolean query:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > public float score(){    return myFunc(innerScorer.score());}
>>> >> >> >     This is fine, but then you would have to do this as
well:
>>> >> >> > public int nextDoc(){
>>> >> >> >
>>> >> >> >
>>> >> >> >    return innerScorer.nextDoc();}
>>> >> >> > and also:
>>> >> >> > public int advance(int target){   return innerScorer.advance();}
>>> >> >> > The difference here is that nextDoc and advance are called
far
>>> more
>>> >> >> > times as
>>> >> >> > score. And you are introducing an extra method call for
them,
>>> which
>>> >> >> > is not
>>> >> >> > insignificant for queries result in large recall sets.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Hope this makes sense.
>>> >> >> > Thanks
>>> >> >> > -John
>>> >> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com>
>>> wrote:
>>> >> >> > I'm not sure I understand what you mean - Scorer is a
DISI
>>> itself,
>>> >> >> > and
>>> >> >> > the scoring formula is mostly controlled by Similarity.
>>> >> >> >
>>> >> >> > What will be the benefits of the proposed change?
>>> >> >> >
>>> >> >> > Shai
>>> >> >> >
>>> >> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.wang@gmail.com>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Hi guys:
>>> >> >> >
>>> >> >> >     I'd like to make a proposal to change the Scorer class/api
to
>>> the
>>> >> >> > following:
>>> >> >> >
>>> >> >> >
>>> >> >> > public abstract class Scorer{
>>> >> >> >    DocIdSetIterator getDocIDSetIterator();
>>> >> >> >    float score(int docid);
>>> >> >> > }
>>> >> >> >
>>> >> >> > Reasons:
>>> >> >> >
>>> >> >> > 1) To build a Scorer from an existing Scorer (e.g. that
produces
>>> raw
>>> >> >> > scores from tfidf), one would decorate it, and it would
introduce
>>> >> >> > overhead
>>> >> >> > (in function calls) around nextDoc and advance, even if
you just
>>> want
>>> >> >> > to
>>> >> >> > augment the score method which is called much fewer times.
>>> >> >> >
>>> >> >> > 2) The current contract forces scoring on the currentDoc
in the
>>> >> >> > underlying iterator. So once you pass "current", you can
no
>>> longer
>>> >> >> > score. In
>>> >> >> > one of our use-cases, it is very inconvenient.
>>> >> >> >
>>> >> >> > What do you think? I can go ahead and open an issue and
work on a
>>> >> >> > patch
>>> >> >> > if I get some agreement.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >> > -John
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> >> Phone: +7 (495) 683-567-4
>>> >> ICQ: 104465785
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> Phone: +7 (495) 683-567-4
>>> ICQ: 104465785
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>

Mime
View raw message