incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Matcher
Date Fri, 06 Feb 2009 10:55:39 GMT
I like this approach Marvin, and I appreciate the meandering path you
went through to arrive at it!

I very much like putting the invocation of Matcher.score()
(Scorer_Tally()) into the collector's hands.  In Lucene we always
score prior to collection, but eg if your app will sort by something
other than relevance, and does not intend to present relevance, then
you don't need to compute the score.

One thing I'm struggling with now (in Lucene), that I'm not sure
applies to KS/Lucy, is how to best take "random access Matchers" into
account.  Not all Matchers are created equal.  Some really want to do
the next()/skipTo() iteration (eg a posting list), but others (deleted
docs currently, or cached filters, in Lucene) are "best" accessed by
random access accept(int docID) API, perhaps permuted down to all leaf
TermScorers, though they could also do iteration.  I'm seeing sizable
performance gains by "taking advantage" of Matchers that could expose
a random access API.  It would be good if we could query a Matcher to
see if it supports random access...

Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.


Marvin Humphrey wrote:

> Greets,
> Many ideas have come together in a new KS class, Matcher, which I  
> think ought
> to replace Scorer in both KS and Lucy.
> Conceptually, Matcher began as a simple iterator over a set of doc  
> nums.  It
> was born because KS needed an opaque super class to support  
> iterating over
> deletions -- which might be backed either by a bit vector or by what  
> we're
> calling "tombstones".
> In Lucene, that's the role played by "DocIdSetIterator".  The name
> "DocIdSetIterator" is precise and descriptive, but since it's a tad  
> verbose I
> stumbled around for a bit looking for an alternative... and arrived at
> "Matcher".
> The idea of a "Matcher" class has been batted around many times  
> before, but
> usually in the context of "Scorer minus the scoring".  This  
> "Matcher" would be
> a more general class, usable not only for iterating over hits, but for
> iterating over any set of doc nums in any context.
> Around this time, Nate and I had a private email exchange on the  
> subject of
> deletions-as-iterators.  Nate:
>    That said, deletions as filters seems like a good way to view  
> things,
>    though.  I'd probably go further, and not have any special handling
>    for them in the base, then subclass Scorer_Collect to
>    Scorer_Collect_Filtered, and allow for stacked filters.
> Interesting... So, if you had both a QueryFilter and some deletions,  
> you might
> merge them together into a single iterator whose output would OR  
> together
> their doc num sets.  You'd probably want to do this with a priority  
> queue...
> Hmm, that's the same technique that we'd be using for merging  
> tombstone rows
> -- thus a generic priority queue class which unions the output of  
> multiple
> Matcher iterators would kill two birds with one stone.
> But wait -- we already have such a class!  ScorerDocQueue is used by  
> ORScorer
> to OR together the doc num sets of multiple Scorers.  And yet, it  
> doesn't
> handle any scoring duties itself -- the only methods it calls are  
> Next(),
> Advance() and Get_Doc_Num(), all of which are defined by Scorer's  
> parent class
> Matcher.  All we have to do to adapt it is change the queue elements  
> from
> Scorers to Matchers.  *Three* birds with one stone!
> Can we pursue this still further?
> The original concept for "Matcher" was a class that matches but does  
> not
> score.  Instead of a HitCollector, one thought went, we might use a
> "MatchCollector" which collects only document numbers.
> The KS HitCollector's Collect() method has recently changed to  
> address the
> "match-only" case.  MatchCollector's Collect() method would  
> theoretically have
> taken only a document number, saving cycles by not requiring the  
> Scorer to
> calculate a score:
>   HC_Collect(collector, doc_num, Scorer_Tally(scorer));
>   /* vs. */
>   MatchColl_Collect(match_collector, doc_num);
> However, passing arguments is cheap.  How about, if instead of  
> passing scoring
> information we pass the Scorer itself, and leave it up to individual
> HitCollectors whether to request scoring info?
>   HC_Collect(collector, doc_num, scorer);
> That way, classes like BitCollector can just accept the doc_num and  
> ignore the
> scorer.  Turns out tere's no need for a MatchCollector class, after  
> all.
> But then, what if we want HC_Collect() to accept a Matcher rather  
> than require
> a Scorer?  Logically, we ought to be able to, but Matcher doesn't  
> currently
> know how to supply the scoring info that some HitCollectors would  
> need.
> Well, we can fix that by moving the Scorer_Tally() method up into  
> Matcher, and
> having it supply a default score of 0.0.  Then, *any* Matcher  
> iterating over a
> set of doc ids can be used as a source of hit data.
> This is another win for code reuse.  Classes which perform  
> sophisticated
> scoring powered by Similarity, such as TF/IDF-enabled TermScorer and
> PhraseScorer, or compound scorers like
> ANDScorer/ORScorer/RequiredOptionalScorer which call  
> Similarity.coord(), can
> subclass constant-scoring Matcher classes, e.g. ORMatcher.  However,  
> classes
> like ORMatcher can also be be used in non-scoring contexts.
> In addition, we make certain kinds of subclasses simpler to write.   
> Nate
> expressed a desire a while back to subclass the KS boolean query/ 
> scorer
> hierarchy so that he could take advantage of its logical matching  
> capabilities
> yet implement custom scoring algos -- but having to cancel out all  
> the TF/IDF
> stuff imposed an annoying complexity cost.  Subclassing boolean  
> Matcher
> classes would have made that project simpler.
> Now, let's assess where we've ended up.  Matcher was originally  
> supposed to be
> a simple iterator class.  It's still quite simple, but with the  
> added methods
> it looks almost identical to the old Scorer -- the only difference  
> is the
> missing Similarity member var.  That begs the question, why not just  
> use
> Scorer everywhere?  Use Scorers to iterate over deletions, Scorers for
> filtering, etc.
> It's possible.  But I think it makes more sense to give our search  
> engine
> library a broader mandate and re-conceptualize its search apparatus  
> as a
> matching tool which can be extended with scoring.  We can achieve  
> that by
> replacing Scorer with Matcher and by using Matcher in contexts where a
> "Scorer" wouldn't have seemed appropriate.
> Marvin Humphrey

View raw message