lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] MatchEngine
Date Thu, 07 Apr 2011 23:35:29 GMT
Greets,

Case 1:

   * Some time ago, Nate Kurz proposed a test for the pluggability of the
     matching engine in Lucy: we should be able to support a Searcher which
     wraps an SQLite index.

Case 2:

    * A few weeks ago, there was interest on the user list in finding out
      which field had matched a Query for a given search result -- information
      which our current matching system does not preserve.

In both cases, to implement the desired behavior it is necessary to start by
writing new low-level Matcher classes which interact with actual hard-core
data.  For SQLite, you need Matchers which find which documents using SQL
commands; for field-match data, you need Matchers which read the same index
formats as Lucy's current Matchers, but which preserve per-hit match criteria
instead of just calculating a score.

However, because of a quirk in the way we compile Queries to Matchers now --
what Matcher you get is 100% determined by what Query you start with -- you
would also need to reimplement a number of high-level Query classes.  This is
the problem I hope to solve with MatchEngine.

Currently, Queries are factories for Compilers and Compilers are factories for
Matchers:

    my $compiler = $query->make_compiler(searcher => $searcher);
    for my $seg_reader (@seg_readers) {
        my $matcher = $compiler->make_matcher(reader => $seg_reader);
        ...
    }

If we eliminate the intermediate stage of Compiler and instead modify Query
objects in place during the weighting stage, the relationship between Query and
Matcher becomes clearer, but we still have the same problem:

    $query->weight(searcher => $searcher);
    for my $seg_reader (@seg_readers) {
        my $matcher = $query->make_matcher(reader => $seg_reader);
        ...
    }

It seems to me that you ought to be able to create a search specification in
the form of a Query object using standard Lucy::Search::*Query classes, and
then feed that Query into *any* Searcher -- regardless of what low-level data
that Searcher wraps or how it interacts with that data.

    # Reuse a Query object with multiple searchers.
    my $standard_searcher = Lucy::Search::IndexSearcher->new(index => $index);
    my $sqlite_searcher   = LucyX::Search::SQLiteSearcher->new(db => $db);
    my $query       = $query_parser->parse("foo bar");
    my $tfidf_hits  = $standard_searcher->hits(query => $query);
    my $sqlite_hits = $sqlite_searcher->hits(query => $query);

To solve this problem, I propose that we sever the direct relationship between
Lucy's core, public-facing Query classes and our matching model.

    * Eliminate Compiler.
    * Eliminate Query_Make_Compiler().
    * Introduce abstract method Query_Make_Matcher(), which none of the core
      Query classes would implement.
    * Introduce MatchEngine.
      o Each Searcher has-a MatchEngine. 
      o MatchEngine transforms high-level Query objects into Query objects
        tied to a matching model.
      o The default MatchEngine would be TFIDFMatchEngine.
  
Essentially, we would be swapping out Query_Make_Compiler() for
MatchEngine_Prepare():

     my $query = $query_parser->parse("foo bar");
-    my $compiler = $query->make_compiler(searcher => $searcher);
+    my $prepared_query = $match_engine->prepare(
+        searcher => $searcher,
+        query    => $query,
     );
     for my $seg_reader (@seg_readers) {
-        my $matcher = $compiler->make_matcher(reader => $seg_reader);
+        my $matcher = $prepared_query->make_matcher(reader => $seg_reader);
         ...
     }

This scheme makes it slightly easier to write something like an SQLiteSearcher
-- but more importantly, it makes it significantly easier for downstream users
to deploy and use an SQLiteSearcher, because you can continue to use standard
Query and QueryParser objects with it.   

On some level, the elimination of Compiler in the plan described above is
sleight-of-hand.  Instead of having two distinct classes (Query and Compiler),
we now have two grades of Query classes: those that implement Make_Matcher()
-- tying them to a matching model -- and those that don't.

This TFIDF stuff has to go *somewhere*, though.  I feel strongly that it does
not belong in either our core Query classes or our core Searcher classes.  If
we don't quarantine TFIDF away from Query and Searcher, implementing an
alternative matching model becomes more complicated: you have all this TFIDF
cruft cluttering up the classes you need to subclass.

If we were to implement this "MatchEngine" proposal as is, TermCompiler,
PhraseCompiler, etc. would probably move out of TermQuery.c, PhraseQuery.c,
etc, and into dedicated class files -- perhaps into a new sub-hierarchy such
as "Lucy::Score":

    * Lucy::Score::TFIDFTermQuery
    * Lucy::Score::TFIDFPhraseQuery
    * Lucy::Score::TFIDFRangeQuery

Personally, I'm not sure it's an improvement to eliminate Compiler as a class
and start differentiating between Query classes based on whether they
implement Make_Matcher().  However, I felt it was important to at least
explore what eliminating Compiler would look like so that we can move the
discussion forward. :)

In any case, it will be nice to get TFIDF out of core/Lucy/Search/*Query.c --
once we exile TFIDF and the weighting/normalizing logic that goes with it,
those basic *Query classes become dead simple containers.

Marvin Humphrey



Mime
View raw message