lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Tue, 08 Jun 2010 22:50:52 GMT
Some people don't do IO while searching at all. When you're over
certain qps/index size threshold, you need less nodes to keep all your
index (or its hot parts) in memory, than to keep combined IO subsystem
throughput high enough to satisfy disc-based search demands.

2010/6/9 Doron Cohen <cdoronc@gmail.com>:
> I too tend to ignore the overhead of delegated calls, especially comparing
> to all other IO ops and computations done by the stack of scorers, but
> accepting that you cannot ignore it, could you achieve the same goal by
> sub-classing the top query where you subclass its weight to return a
> sub-class of its scorer which would only override score() but not the other
> methods, and in score would apply that eg decay logic? This way no
> delegation is required for the other methods. A disadvantage of this is that
> you would need subclass like this any kind of top level query that might
> come up in your app - so not sure if this is really acceptable in your case.
> Another disadvantage is that this is a much more complicated code to write.
>
> Doron
>
> 2010/6/8 John Wang <john.wang@gmail.com>
>>
>> Wouldn't you get it as well with proposed api?
>> You would still be able to iterate the doc and at that point call score
>> with the docid. If you call score() along with iteration, you would still
>> get the information no?
>> Making scorer take a docid allows you score any docid in the reader if the
>> query wants it to. Wouldn't it make it more flexible?
>> -John
>>
>> On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
>>>
>>> To compute a score you have to see which of your subqueries did not
>>> match, which did, and what are the docfreqs/positions for them.
>>> When iterating, and calling score() only for current doc - parts of
>>> this data (maybe even all of it, not sure) is already gathered for
>>> you. If you allow calling score(int doc) - for arbitrary docId, you'll
>>> have to redo this work.
>>>
>>> 2010/6/8 John Wang <john.wang@gmail.com>:
>>> > Hi Earwin:
>>> >
>>> >      I am not sure I understand here, e.g. what si the difference
>>> > between:
>>> >
>>> >      float myscorinCode(){
>>> >          computeMyScore(scorer.score());
>>> >      }
>>> >
>>> >      and
>>> >
>>> >       float myscorinCode(){
>>> >
>>> > computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
>>> >       }
>>> >
>>> >       In the case of BQ, when you get a hit, would you still be able
to
>>> > call
>>> > subscorer.score(hit)? Why is the point of iteration important for BQ?
>>> >
>>> >       please elaborate.
>>> >
>>> > Thanks
>>> >
>>> > -John
>>> >
>>> > On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <earwin@gmail.com>
>>> > wrote:
>>> >>
>>> >> The problem with your proposal is that, currently, Lucene uses current
>>> >> iteration state to compute score.
>>> >> I.e. it already knows which of SHOULD BQ clauses matched for current
>>> >> doc, so it's easier to calculate the score.
>>> >> If you change API to allow scoring arbitrary documents (even those
>>> >> that didn't match the query at all), you're opening a can of worms :)
>>> >>
>>> >> As an alternative, you can try looking at MG4J sources. As far as I
>>> >> understand, their scoring is decoupled from matching, just like you
>>> >> (and I bet many more people) want. The matcher is separate, and the
>>> >> scoring entity accepts current matcher state instead of document id,
>>> >> so you get the best of both worlds.
>>> >>
>>> >> On Tue, Jun 8, 2010 at 21:01, John Wang <john.wang@gmail.com>
wrote:
>>> >> > re: But Scorer is itself an iterator, so what prevents you from
>>> >> > calling
>>> >> > nextDoc and advance on it without score()
>>> >> >
>>> >> > Nothing. It is just inefficient to pay the method call overhead
just
>>> >> > to
>>> >> > overload score.
>>> >> >
>>> >> > re: If I were in your shoes, I'd simply provider a Query wrapper.
If
>>> >> > CSQ
>>> >> > is not good enough I'd just develop my own.
>>> >> >
>>> >> > That is what I am doing. I am just proposing the change (see my
>>> >> > first
>>> >> > email)
>>> >> > as an improvement.
>>> >> >
>>> >> > re: Scorer is itself an iterator
>>> >> >
>>> >> > yes, that is the current definition. The point of the proposal
is to
>>> >> > make
>>> >> > this change.
>>> >> >
>>> >> > -John
>>> >> >
>>> >> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <serera@gmail.com>
wrote:
>>> >> >>
>>> >> >> Well … I don't know the reason as well and always thought
Scorer
>>> >> >> and
>>> >> >> Similarity are confusing.
>>> >> >>
>>> >> >> But Scorer is itself an iterator, so what prevents you from
calling
>>> >> >> nextDoc and advance on it without score(). And what would the
>>> >> >> returned
>>> >> >> DISI do when nextDoc is called, if not delegate to its subs?
>>> >> >>
>>> >> >> If I were in your shoes, I'd simply provider a Query wrapper.
If
>>> >> >> CSQ
>>> >> >> is not good enough I'd just develop my own.
>>> >> >>
>>> >> >> But perhaps others think differently?
>>> >> >>
>>> >> >> Shai
>>> >> >>
>>> >> >> On Tuesday, June 8, 2010, John Wang <john.wang@gmail.com>
wrote:
>>> >> >> > Hi Shai:
>>> >> >> >     I am not sure I understand how changing Similarity
would
>>> >> >> > solve
>>> >> >> > this
>>> >> >> > problem, wouldn't you need the reader?
>>> >> >> >     As for PayloadTermQuery, payload is not always
the most
>>> >> >> > efficient
>>> >> >> > way of storing such data, especially when number of terms
<<
>>> >> >> > numdocs.
>>> >> >> > (I am
>>> >> >> > not sure accessing the payload when you iterate is a good
idea,
>>> >> >> > but
>>> >> >> > that is
>>> >> >> > another discussion)
>>> >> >> >
>>> >> >> >     Yes, what I described is exactly a simple CustomScoreQuery
>>> >> >> > for a
>>> >> >> > special use-case. The problem is also in CustomScoreQuery,
where
>>> >> >> > nextDoc and
>>> >> >> > advance are calling the sub-scorers as a wrapper. This
can be
>>> >> >> > avoided
>>> >> >> > if the
>>> >> >> > Scorer returns an iterator instead.
>>> >> >> >
>>> >> >> >     Separating scoring and doc iteration is a good
idea anyway. I
>>> >> >> > don't
>>> >> >> > know the reason to combine them originally.
>>> >> >> > Thanks
>>> >> >> > -John
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <serera@gmail.com>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > So wouldn't it make sense to add some method to Similarity?
Which
>>> >> >> > receives the doc Id in question maybe ... just thinking
here.
>>> >> >> >
>>> >> >> > Factoring Scorer like you propose would create 3 objects
for
>>> >> >> > scoring/iterating: Scorer (which really becomes an iterator),
>>> >> >> > Similarity and
>>> >> >> > CustomScoreFunction ...
>>> >> >> >
>>> >> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery?
depends
>>> >> >> > how
>>> >> >> > you
>>> >> >> > compute your age decay function (where you pull the data
about
>>> >> >> > the
>>> >> >> > age of
>>> >> >> > the document).
>>> >> >> >
>>> >> >> > Shai
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.wang@gmail.com>
>>> >> >> > wrote:
>>> >> >> > Hi Shai:
>>> >> >> >     Similarity in many cases is not sufficient for
scoring. For
>>> >> >> > example,
>>> >> >> > to implement age decaying of a document (very useful for
corpuses
>>> >> >> > like news
>>> >> >> > or tweets), you want to project the raw tfidf score onto
a time
>>> >> >> > curve, say
>>> >> >> > f(x), to do this, you'd have a custom scorer that decorates
the
>>> >> >> > underlying
>>> >> >> > scorer from your say, boolean query:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > public float score(){    return myFunc(innerScorer.score());}
>>> >> >> >     This is fine, but then you would have to do this
as well:
>>> >> >> > public int nextDoc(){
>>> >> >> >
>>> >> >> >
>>> >> >> >    return innerScorer.nextDoc();}
>>> >> >> > and also:
>>> >> >> > public int advance(int target){   return innerScorer.advance();}
>>> >> >> > The difference here is that nextDoc and advance are called
far
>>> >> >> > more
>>> >> >> > times as
>>> >> >> > score. And you are introducing an extra method call for
them,
>>> >> >> > which
>>> >> >> > is not
>>> >> >> > insignificant for queries result in large recall sets.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Hope this makes sense.
>>> >> >> > Thanks
>>> >> >> > -John
>>> >> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com>
>>> >> >> > wrote:
>>> >> >> > I'm not sure I understand what you mean - Scorer is a
DISI
>>> >> >> > itself,
>>> >> >> > and
>>> >> >> > the scoring formula is mostly controlled by Similarity.
>>> >> >> >
>>> >> >> > What will be the benefits of the proposed change?
>>> >> >> >
>>> >> >> > Shai
>>> >> >> >
>>> >> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.wang@gmail.com>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Hi guys:
>>> >> >> >
>>> >> >> >     I'd like to make a proposal to change the Scorer
class/api to
>>> >> >> > the
>>> >> >> > following:
>>> >> >> >
>>> >> >> >
>>> >> >> > public abstract class Scorer{
>>> >> >> >    DocIdSetIterator getDocIDSetIterator();
>>> >> >> >    float score(int docid);
>>> >> >> > }
>>> >> >> >
>>> >> >> > Reasons:
>>> >> >> >
>>> >> >> > 1) To build a Scorer from an existing Scorer (e.g. that
produces
>>> >> >> > raw
>>> >> >> > scores from tfidf), one would decorate it, and it would
introduce
>>> >> >> > overhead
>>> >> >> > (in function calls) around nextDoc and advance, even if
you just
>>> >> >> > want
>>> >> >> > to
>>> >> >> > augment the score method which is called much fewer times.
>>> >> >> >
>>> >> >> > 2) The current contract forces scoring on the currentDoc
in the
>>> >> >> > underlying iterator. So once you pass "current", you can
no
>>> >> >> > longer
>>> >> >> > score. In
>>> >> >> > one of our use-cases, it is very inconvenient.
>>> >> >> >
>>> >> >> > What do you think? I can go ahead and open an issue and
work on a
>>> >> >> > patch
>>> >> >> > if I get some agreement.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >> > -John
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> >> Phone: +7 (495) 683-567-4
>>> >> ICQ: 104465785
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> Phone: +7 (495) 683-567-4
>>> ICQ: 104465785
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message