incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Matcher
Date Thu, 29 Jan 2009 02:56:22 GMT

Many ideas have come together in a new KS class, Matcher, which I think ought
to replace Scorer in both KS and Lucy.

Conceptually, Matcher began as a simple iterator over a set of doc nums.  It
was born because KS needed an opaque super class to support iterating over
deletions -- which might be backed either by a bit vector or by what we're
calling "tombstones".

In Lucene, that's the role played by "DocIdSetIterator".  The name
"DocIdSetIterator" is precise and descriptive, but since it's a tad verbose I
stumbled around for a bit looking for an alternative... and arrived at

The idea of a "Matcher" class has been batted around many times before, but
usually in the context of "Scorer minus the scoring".  This "Matcher" would be
a more general class, usable not only for iterating over hits, but for
iterating over any set of doc nums in any context.

Around this time, Nate and I had a private email exchange on the subject of
deletions-as-iterators.  Nate:

    That said, deletions as filters seems like a good way to view things, 
    though.  I'd probably go further, and not have any special handling 
    for them in the base, then subclass Scorer_Collect to 
    Scorer_Collect_Filtered, and allow for stacked filters.

Interesting... So, if you had both a QueryFilter and some deletions, you might
merge them together into a single iterator whose output would OR together
their doc num sets.  You'd probably want to do this with a priority queue...
Hmm, that's the same technique that we'd be using for merging tombstone rows
-- thus a generic priority queue class which unions the output of multiple
Matcher iterators would kill two birds with one stone.

But wait -- we already have such a class!  ScorerDocQueue is used by ORScorer
to OR together the doc num sets of multiple Scorers.  And yet, it doesn't
handle any scoring duties itself -- the only methods it calls are Next(),
Advance() and Get_Doc_Num(), all of which are defined by Scorer's parent class
Matcher.  All we have to do to adapt it is change the queue elements from
Scorers to Matchers.  *Three* birds with one stone!

Can we pursue this still further?

The original concept for "Matcher" was a class that matches but does not
score.  Instead of a HitCollector, one thought went, we might use a
"MatchCollector" which collects only document numbers.  

The KS HitCollector's Collect() method has recently changed to address the
"match-only" case.  MatchCollector's Collect() method would theoretically have
taken only a document number, saving cycles by not requiring the Scorer to
calculate a score:

   HC_Collect(collector, doc_num, Scorer_Tally(scorer));
   /* vs. */
   MatchColl_Collect(match_collector, doc_num);

However, passing arguments is cheap.  How about, if instead of passing scoring
information we pass the Scorer itself, and leave it up to individual
HitCollectors whether to request scoring info?

   HC_Collect(collector, doc_num, scorer);

That way, classes like BitCollector can just accept the doc_num and ignore the
scorer.  Turns out tere's no need for a MatchCollector class, after all.

But then, what if we want HC_Collect() to accept a Matcher rather than require
a Scorer?  Logically, we ought to be able to, but Matcher doesn't currently
know how to supply the scoring info that some HitCollectors would need.

Well, we can fix that by moving the Scorer_Tally() method up into Matcher, and
having it supply a default score of 0.0.  Then, *any* Matcher iterating over a
set of doc ids can be used as a source of hit data.

This is another win for code reuse.  Classes which perform sophisticated
scoring powered by Similarity, such as TF/IDF-enabled TermScorer and
PhraseScorer, or compound scorers like
ANDScorer/ORScorer/RequiredOptionalScorer which call Similarity.coord(), can
subclass constant-scoring Matcher classes, e.g. ORMatcher.  However, classes
like ORMatcher can also be be used in non-scoring contexts.
In addition, we make certain kinds of subclasses simpler to write.  Nate
expressed a desire a while back to subclass the KS boolean query/scorer
hierarchy so that he could take advantage of its logical matching capabilities
yet implement custom scoring algos -- but having to cancel out all the TF/IDF
stuff imposed an annoying complexity cost.  Subclassing boolean Matcher
classes would have made that project simpler.

Now, let's assess where we've ended up.  Matcher was originally supposed to be
a simple iterator class.  It's still quite simple, but with the added methods
it looks almost identical to the old Scorer -- the only difference is the
missing Similarity member var.  That begs the question, why not just use
Scorer everywhere?  Use Scorers to iterate over deletions, Scorers for
filtering, etc.

It's possible.  But I think it makes more sense to give our search engine
library a broader mandate and re-conceptualize its search apparatus as a
matching tool which can be extended with scoring.  We can achieve that by
replacing Scorer with Matcher and by using Matcher in contexts where a
"Scorer" wouldn't have seemed appropriate.

Marvin Humphrey

View raw message