lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Proposal: Scorer api change
Date Wed, 09 Jun 2010 10:42:42 GMT
Ok point taken - don't trust on the JVM ! I don't trust it either.

So for a TermQuery, which needs to evaluate 1M docs, you add 1M nextDoc
calls w/ the delegate approach. But for a BQ, that's not the case. You add
one method call which can be followed by a series of nextDoc/advance calls
by the sub-scorers, and so the overhead of the delegate approach is not
determined, and is query-dependent, and in some cases will be very low. That
was my point when I said it's negligible.

And I do think we should move to a Matcher/Scorer model, since like Mike
says, matching docs has nothing to do with scoring. The matching part is
boolean - either the doc answers the query or not - less room for custom
code here. The scoring part is the logic which is usually customized. In
fact, when I moved to Lucene I looked for that logic in the code, and was
confused that Scorer is also the matcher.

I think we've discussed this once, decoupling Weight/Scorer from Query --
that was related to the Collector API changes. I don't remember the outcome,
but perhaps we should discuss it again - can (should) we decouple Query from
Scorer/Weight, such that one can provide his own Scorer implementation to a
Query, w/o knowing the internals of the Query? Can we represent the Query
state in some general structure, that no matter which Query you get, you'll
know how to score it?

Shai

On Wed, Jun 9, 2010 at 1:02 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I generally don't trust the compiler, if/when I have that freedom.
>
> If you can fix a hotspot in Lucene to avoid an extra method call, an
> extra add/multiply, etc., you should.  Doing so ensures the cost can't
> be there.  Not doing so means you rely on the JRE to be smart enough,
> and it very easily may not be (there are so many variables), and that
> also makes Lucene's performance more fragile/env-specific.
>
> Why take that chance?
>
> I also don't rely on benchmarks to validate this on a case-by-case
> basis; the cost for any single change (like this one) can easily be in
> the noise, yet these micro-costs do add up.
>
> Different rules apply "down low".  It's like quantum physics!
>
> I think, besides avoiding method calls, there are compelling reasons
> to consider a stronger decoupling of matching & scoring.  A Query
> really ought to be two separable things -- matching (like Filter) and
> scoring.
>
> EG DisjunctionMaxQuery has its own matching code that duplicates what
> BooleanQuery does if the query is all SHOULD clauses.  Why duplicate
> this code?  Why restrict the "max score of all subs = doc's score" to
> only SHOULD-only BooleanQueries?  If we had full matching/scoring
> decoupling, we wouldn't have to.
>
> Or, eg the BM25 patch (LUCENE-2091) had to create its own
> BM25BooleanQuery to do matching & scoring, which is silly -- if it's
> only changing how scoring works, it should be able to reuse the
> existing matching code in BooleanQuery.
>
> That said, there are challenges; eg the higher performance
> BooleanScorer (which scores docs in "chunks" and is free to collect
> them out-of-order) would be challenging to fully decouple from scoring
> since it's not strictly "doc-at-once".
>
> On the other part of the proposal (allowing .score() to take an
> arbitrary docID), that does sound like a can of worms.  MG4J's model
> (scorer receives the full "state" of the matcher and can peek in as
> necessary) sounds compelling...
>
> Mike
>
> On Wed, Jun 9, 2010 at 3:35 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
> > Lies, lies, lies :)
> > I mean, Sun JIT is overrelied on. Especially in regards to inlining.
> >
> > But, there are some cases when you can trust it. I.e. if you call a
> > virtual method and this exact call-site gets refs to different objects
> > at runtime (meaning here - you wrap different Queries in your
> > WrapperQuery) - you can definetly rely on a call not being inlined.
> >
> > So, I agree with John on his /rough/ overhead estimates, on the part
> > that it exists, and it's detectable. I don't agree on allowing
> > arbitrary doc scoring. People who really need this for some strange
> > applications, can emulate this now - by advancing() scorer to needed
> > doc, and calling score(). But for most people it's unnecessary, and as
> > I said - will lead to scaaary code.
> >
> > If you really think that one or two method calls in a loop are
> > neglible, I ask you to join my holy crusade and erase
> > Scorer.score(Collector) set of methods :) they exist there for the
> > sole purporse of cutting on a few method calls, and are really,
> > really, really confusing.
> >
> >
> > 2010/6/9 Shai Erera <serera@gmail.com>:
> >> I don't think the method call is an overhead John. You don't need to
> >> reiterate it. The compiler does make optimizations and inlines such
> >> code/calls if it can. More than that, the query processing involves so
> much
> >> method calls, that I do think that's insignificant.
> >
> > Woohoo! Mexican standoff! :)
> >
> > --
> > Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> > Phone: +7 (495) 683-567-4
> > ICQ: 104465785
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message