lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <cdor...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Wed, 09 Jun 2010 07:01:33 GMT
Hi John,

I think there are two aspects to the modified API suggestion:

(1) allow custom scoring with less delegation calls overhead.
(2) support arbitrary doc scoring with each scorer (nicely put by Earwin)
allowing scoring docs also not in docid order.

Point (2) is nice feature. How largely is it needed? Could you elaborate on
the use case where you need to score docs not in docid order?

Point (1) is a performance issue, and so some numbers would help to realize
how important it is. This can be measured by creating a NeutralCustomQuery
(NCQ) over a delegee query Q, such that NCQ just returns the score Q. Now
the performance of these two queries can be measured and compared - they do
the same thing, except that all NCQ's methods delegate to that of Q. Could
you try on an index sufficiently large/small to demonstrate the delegation
penalty and share the results here?

Regards,
Doron

2010/6/9 John Wang <john.wang@gmail.com>

> re: "The compiler does make optimizations and inlines such code/calls if
> it can"
>
> Are you really sure of this in THIS case? Can you elaborate WHEN such
> inline optimizations happen and how it applies here?
>
> This sounds to me like a very vague and irresponsible statement.
>
> Many java literature do not support this statement. There are reasons why
> in this case this cannot be inlined.
>
> re: method call overhead:
>
> This is a small penalty for each hit candidate, which means possiblly
> millions of times per given query, it ADDS UP! This is why some core
> committers like Yonik/Mike code tunes on the level for removing branching
> and assignments.
>
> Whether the method call overhead is insignificant or not is subjective.
> e.g. if your corpus size is small, or expected recall size is small, it is
> not a big deal, and it IS insignificant. But not true in general.
>
> Arguing one way or other as a general statement without support can seem
> irresponsible and pompous. It does not matter what you or I think if it is
> an overhead or not, I would try to keep the discussion on the technical
> aspects.
>
> B.T.W.: Have you done much performance work before?
>
> "code that will 99.99999999% of the times won't be called at all" <- Are
> you saying you know how Lucene is used in all cases? Did you take a poll or
> something? ;)
>
> I would gladly accept this being too much of an api change for now, and the
> point Earwin raised is quite valid. This was meant to be a proposal to get
> some constructive discussions going. Clearly it is not happening...
>
> -John
>
>
> 2010/6/8 Shai Erera <serera@gmail.com>
>
> I agree w/ both Doron and Earwin, on different points though.
>>
>> I don't think the method call is an overhead John. You don't need to
>> reiterate it. The compiler does make optimizations and inlines such
>> code/calls if it can. More than that, the query processing involves so much
>> method calls, that I do think that's insignificant.
>>
>> And my last sentence is related to Earwin's thoughts -- it's insignificant
>> compared to writing complicated code which needs to support arbitrary doc
>> scoring - code that will 99.99999999% of the times won't be called at all,
>> yet we'll need to go through hoola hoops to support it in all Lucene Scorers
>> and custom ones (outside Lucene). So whatever overhead you claim the
>> delegate call adds to the search, I think it's insignificant if in order to
>> avoid it we need to break the API and introduce more complicated code.
>>
>> One complication is preparing for a case where someone asks to score a doc
>> id N, when all Scorers are at doc id N+1 (or further). How do you do that?
>> restart? What if the Scorers don't agree on doc id N at all? What will be
>> the score then?
>>
>> When you ask Scorer.score(), you know it does so on the current doc id,
>> which all sub scorers agreed on. It is a 'match'. I like Earwin's analogy to
>> matcher (though I'm not familiar with MG4J) -- it's like the Java Matcher.
>> You ask a Pattern to match something, and receive a Matcher, on which you
>> can do what you want. You don't ask the Pattern to get you the next match,
>> and then ask it again to do something with it ... not sure if my analogy is
>> 1:1 with the proposed change, but I feel that the proposed change decouples
>> matching and scoring unnecessarily.
>>
>> Shai
>>
>>
>> On Wed, Jun 9, 2010 at 1:50 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
>>
>>> Some people don't do IO while searching at all. When you're over
>>> certain qps/index size threshold, you need less nodes to keep all your
>>> index (or its hot parts) in memory, than to keep combined IO subsystem
>>> throughput high enough to satisfy disc-based search demands.
>>>
>>> 2010/6/9 Doron Cohen <cdoronc@gmail.com>:
>>> > I too tend to ignore the overhead of delegated calls, especially
>>> comparing
>>> > to all other IO ops and computations done by the stack of scorers, but
>>> > accepting that you cannot ignore it, could you achieve the same goal by
>>> > sub-classing the top query where you subclass its weight to return a
>>> > sub-class of its scorer which would only override score() but not the
>>> other
>>> > methods, and in score would apply that eg decay logic? This way no
>>> > delegation is required for the other methods. A disadvantage of this is
>>> that
>>> > you would need subclass like this any kind of top level query that
>>> might
>>> > come up in your app - so not sure if this is really acceptable in your
>>> case.
>>> > Another disadvantage is that this is a much more complicated code to
>>> write.
>>> >
>>> > Doron
>>> >
>>> > 2010/6/8 John Wang <john.wang@gmail.com>
>>> >>
>>> >> Wouldn't you get it as well with proposed api?
>>> >> You would still be able to iterate the doc and at that point call
>>> score
>>> >> with the docid. If you call score() along with iteration, you would
>>> still
>>> >> get the information no?
>>> >> Making scorer take a docid allows you score any docid in the reader
if
>>> the
>>> >> query wants it to. Wouldn't it make it more flexible?
>>> >> -John
>>> >>
>>> >> On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot <earwin@gmail.com>
>>> wrote:
>>> >>>
>>> >>> To compute a score you have to see which of your subqueries did
not
>>> >>> match, which did, and what are the docfreqs/positions for them.
>>> >>> When iterating, and calling score() only for current doc - parts
of
>>> >>> this data (maybe even all of it, not sure) is already gathered for
>>> >>> you. If you allow calling score(int doc) - for arbitrary docId,
>>> you'll
>>> >>> have to redo this work.
>>> >>>
>>> >>> 2010/6/8 John Wang <john.wang@gmail.com>:
>>> >>> > Hi Earwin:
>>> >>> >
>>> >>> >      I am not sure I understand here, e.g. what si the difference
>>> >>> > between:
>>> >>> >
>>> >>> >      float myscorinCode(){
>>> >>> >          computeMyScore(scorer.score());
>>> >>> >      }
>>> >>> >
>>> >>> >      and
>>> >>> >
>>> >>> >       float myscorinCode(){
>>> >>> >
>>> >>> > computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
>>> >>> >       }
>>> >>> >
>>> >>> >       In the case of BQ, when you get a hit, would you still
be
>>> able to
>>> >>> > call
>>> >>> > subscorer.score(hit)? Why is the point of iteration important
for
>>> BQ?
>>> >>> >
>>> >>> >       please elaborate.
>>> >>> >
>>> >>> > Thanks
>>> >>> >
>>> >>> > -John
>>> >>> >
>>> >>> > On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <earwin@gmail.com
>>> >
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> The problem with your proposal is that, currently, Lucene
uses
>>> current
>>> >>> >> iteration state to compute score.
>>> >>> >> I.e. it already knows which of SHOULD BQ clauses matched
for
>>> current
>>> >>> >> doc, so it's easier to calculate the score.
>>> >>> >> If you change API to allow scoring arbitrary documents
(even those
>>> >>> >> that didn't match the query at all), you're opening a can
of worms
>>> :)
>>> >>> >>
>>> >>> >> As an alternative, you can try looking at MG4J sources.
As far as
>>> I
>>> >>> >> understand, their scoring is decoupled from matching, just
like
>>> you
>>> >>> >> (and I bet many more people) want. The matcher is separate,
and
>>> the
>>> >>> >> scoring entity accepts current matcher state instead of
document
>>> id,
>>> >>> >> so you get the best of both worlds.
>>> >>> >>
>>> >>> >> On Tue, Jun 8, 2010 at 21:01, John Wang <john.wang@gmail.com>
>>> wrote:
>>> >>> >> > re: But Scorer is itself an iterator, so what prevents
you from
>>> >>> >> > calling
>>> >>> >> > nextDoc and advance on it without score()
>>> >>> >> >
>>> >>> >> > Nothing. It is just inefficient to pay the method
call overhead
>>> just
>>> >>> >> > to
>>> >>> >> > overload score.
>>> >>> >> >
>>> >>> >> > re: If I were in your shoes, I'd simply provider a
Query
>>> wrapper. If
>>> >>> >> > CSQ
>>> >>> >> > is not good enough I'd just develop my own.
>>> >>> >> >
>>> >>> >> > That is what I am doing. I am just proposing the change
(see my
>>> >>> >> > first
>>> >>> >> > email)
>>> >>> >> > as an improvement.
>>> >>> >> >
>>> >>> >> > re: Scorer is itself an iterator
>>> >>> >> >
>>> >>> >> > yes, that is the current definition. The point of
the proposal
>>> is to
>>> >>> >> > make
>>> >>> >> > this change.
>>> >>> >> >
>>> >>> >> > -John
>>> >>> >> >
>>> >>> >> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com>
>>> wrote:
>>> >>> >> >>
>>> >>> >> >> Well ... I don't know the reason as well and always
thought
>>> Scorer
>>> >>> >> >> and
>>> >>> >> >> Similarity are confusing.
>>> >>> >> >>
>>> >>> >> >> But Scorer is itself an iterator, so what prevents
you from
>>> calling
>>> >>> >> >> nextDoc and advance on it without score(). And
what would the
>>> >>> >> >> returned
>>> >>> >> >> DISI do when nextDoc is called, if not delegate
to its subs?
>>> >>> >> >>
>>> >>> >> >> If I were in your shoes, I'd simply provider a
Query wrapper.
>>> If
>>> >>> >> >> CSQ
>>> >>> >> >> is not good enough I'd just develop my own.
>>> >>> >> >>
>>> >>> >> >> But perhaps others think differently?
>>> >>> >> >>
>>> >>> >> >> Shai
>>> >>> >> >>
>>> >>> >> >> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com>
>>> wrote:
>>> >>> >> >> > Hi Shai:
>>> >>> >> >> >     I am not sure I understand how changing
Similarity would
>>> >>> >> >> > solve
>>> >>> >> >> > this
>>> >>> >> >> > problem, wouldn't you need the reader?
>>> >>> >> >> >     As for PayloadTermQuery, payload is not
always the most
>>> >>> >> >> > efficient
>>> >>> >> >> > way of storing such data, especially when
number of terms <<
>>> >>> >> >> > numdocs.
>>> >>> >> >> > (I am
>>> >>> >> >> > not sure accessing the payload when you iterate
is a good
>>> idea,
>>> >>> >> >> > but
>>> >>> >> >> > that is
>>> >>> >> >> > another discussion)
>>> >>> >> >> >
>>> >>> >> >> >     Yes, what I described is exactly a
>>> simple CustomScoreQuery
>>> >>> >> >> > for a
>>> >>> >> >> > special use-case. The problem is also in
CustomScoreQuery,
>>> where
>>> >>> >> >> > nextDoc and
>>> >>> >> >> > advance are calling the sub-scorers as a
wrapper. This can be
>>> >>> >> >> > avoided
>>> >>> >> >> > if the
>>> >>> >> >> > Scorer returns an iterator instead.
>>> >>> >> >> >
>>> >>> >> >> >     Separating scoring and doc iteration
is a good idea
>>> anyway. I
>>> >>> >> >> > don't
>>> >>> >> >> > know the reason to combine them originally.
>>> >>> >> >> > Thanks
>>> >>> >> >> > -John
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera
<serera@gmail.com
>>> >
>>> >>> >> >> > wrote:
>>> >>> >> >> >
>>> >>> >> >> > So wouldn't it make sense to add some method
to Similarity?
>>> Which
>>> >>> >> >> > receives the doc Id in question maybe ...
just thinking here.
>>> >>> >> >> >
>>> >>> >> >> > Factoring Scorer like you propose would create
3 objects for
>>> >>> >> >> > scoring/iterating: Scorer (which really becomes
an iterator),
>>> >>> >> >> > Similarity and
>>> >>> >> >> > CustomScoreFunction ...
>>> >>> >> >> >
>>> >>> >> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery?
>>> depends
>>> >>> >> >> > how
>>> >>> >> >> > you
>>> >>> >> >> > compute your age decay function (where you
pull the data
>>> about
>>> >>> >> >> > the
>>> >>> >> >> > age of
>>> >>> >> >> > the document).
>>> >>> >> >> >
>>> >>> >> >> > Shai
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang
<
>>> john.wang@gmail.com>
>>> >>> >> >> > wrote:
>>> >>> >> >> > Hi Shai:
>>> >>> >> >> >     Similarity in many cases is not sufficient
for scoring.
>>> For
>>> >>> >> >> > example,
>>> >>> >> >> > to implement age decaying of a document (very
useful for
>>> corpuses
>>> >>> >> >> > like news
>>> >>> >> >> > or tweets), you want to project the raw tfidf
score onto a
>>> time
>>> >>> >> >> > curve, say
>>> >>> >> >> > f(x), to do this, you'd have a custom scorer
that decorates
>>> the
>>> >>> >> >> > underlying
>>> >>> >> >> > scorer from your say, boolean query:
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > public float score(){    return myFunc(innerScorer.score());}
>>> >>> >> >> >     This is fine, but then you would have
to do this as well:
>>> >>> >> >> > public int nextDoc(){
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >    return innerScorer.nextDoc();}
>>> >>> >> >> > and also:
>>> >>> >> >> > public int advance(int target){   return
>>> innerScorer.advance();}
>>> >>> >> >> > The difference here is that nextDoc and advance
are called
>>> far
>>> >>> >> >> > more
>>> >>> >> >> > times as
>>> >>> >> >> > score. And you are introducing an extra method
call for them,
>>> >>> >> >> > which
>>> >>> >> >> > is not
>>> >>> >> >> > insignificant for queries result in large
recall sets.
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > Hope this makes sense.
>>> >>> >> >> > Thanks
>>> >>> >> >> > -John
>>> >>> >> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera
<serera@gmail.com
>>> >
>>> >>> >> >> > wrote:
>>> >>> >> >> > I'm not sure I understand what you mean -
Scorer is a DISI
>>> >>> >> >> > itself,
>>> >>> >> >> > and
>>> >>> >> >> > the scoring formula is mostly controlled
by Similarity.
>>> >>> >> >> >
>>> >>> >> >> > What will be the benefits of the proposed
change?
>>> >>> >> >> >
>>> >>> >> >> > Shai
>>> >>> >> >> >
>>> >>> >> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang
<
>>> john.wang@gmail.com>
>>> >>> >> >> > wrote:
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > Hi guys:
>>> >>> >> >> >
>>> >>> >> >> >     I'd like to make a proposal to change
the Scorer
>>> class/api to
>>> >>> >> >> > the
>>> >>> >> >> > following:
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > public abstract class Scorer{
>>> >>> >> >> >    DocIdSetIterator getDocIDSetIterator();
>>> >>> >> >> >    float score(int docid);
>>> >>> >> >> > }
>>> >>> >> >> >
>>> >>> >> >> > Reasons:
>>> >>> >> >> >
>>> >>> >> >> > 1) To build a Scorer from an existing Scorer
(e.g. that
>>> produces
>>> >>> >> >> > raw
>>> >>> >> >> > scores from tfidf), one would decorate it,
and it would
>>> introduce
>>> >>> >> >> > overhead
>>> >>> >> >> > (in function calls) around nextDoc and advance,
even if you
>>> just
>>> >>> >> >> > want
>>> >>> >> >> > to
>>> >>> >> >> > augment the score method which is called
much fewer times.
>>> >>> >> >> >
>>> >>> >> >> > 2) The current contract forces scoring on
the currentDoc in
>>> the
>>> >>> >> >> > underlying iterator. So once you pass "current",
you can no
>>> >>> >> >> > longer
>>> >>> >> >> > score. In
>>> >>> >> >> > one of our use-cases, it is very inconvenient.
>>> >>> >> >> >
>>> >>> >> >> > What do you think? I can go ahead and open
an issue and work
>>> on a
>>> >>> >> >> > patch
>>> >>> >> >> > if I get some agreement.
>>> >>> >> >> >
>>> >>> >> >> > Thanks
>>> >>> >> >> >
>>> >>> >> >> > -John
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> ---------------------------------------------------------------------
>>> >>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>> >> >>
>>> >>> >> >
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> --
>>> >>> >> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> >>> >> Phone: +7 (495) 683-567-4
>>> >>> >> ICQ: 104465785
>>> >>> >>
>>> >>> >>
>>> ---------------------------------------------------------------------
>>> >>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> >>> Phone: +7 (495) 683-567-4
>>> >>> ICQ: 104465785
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> Phone: +7 (495) 683-567-4
>>> ICQ: 104465785
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>

Mime
View raw message