incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] Refining Query-to-Matcher compilation
Date Thu, 07 Apr 2011 23:29:33 GMT
On Mon, Apr 04, 2011 at 12:44:41PM -0700, Nathan Kurz wrote:
> So I'll focus on a single (contentious) point:  As an object, Compiler can
> just go away.  It's a silly optimization for TF/IDF scoring and thus
> shouldn't be a primary architectural feature.  It's just a Query with the
> boosts filled in.

So long as we're going to support TF/IDF, its complexity can only be hidden,
not eliminated.  Many alternative weighting and matching schemes (BM25, TF/ICF,
LSA, etc.) also require corpus-wide statistics.

Unfortunately, Query's "boost" member variable -- which is there to support
user-specified boosting -- isn't the only one we need.  For example,
TermCompiler currently adds these four member variables:

    float idf;
    float raw_weight;
    float query_norm_factor;
    float normalized_weight;

Propagating weighting information through a nested query structure also
involves several methods.  Right now, the list includes
Sum_Of_Squared_Weights(), Normalize(), Apply_Norm_Factor(), and so on -- all of
which are TF/IDF-specific.  However, we don't care so much about what those
methods do so much as we care about supporting weighting systems which can be
spread out over an arbitrary number of methods.

When weighting an arbitrarily complex query, we have to allow the scoring
model the option of having member variables and methods which perform the
weighting, and we have to allow for the possibility that it will proceed in an
arbitrary number of stages, requiring gradual modifications to complex
internal states before collapsing down to a final "weight" -- if it ever does.

Because we can hide those all the methods and members that implement this
complexity away in Compiler and its subclasses, we don't have to clutter up
Query.

I think we can do a better job of quarantining the complexity away from our 
core Query classes, though.

> > However, I do not wish to pursue MatchEngine right this moment.
> 
> OK, that's off limits for now.

I'm happy to discuss MatchEngine or other possibilities, and I will be firing
off an email on MatchEngine shortly. It's just that we shouldn't block our
0.1.0-incubating release on getting MatchEngine sorted.

> > A Query is a simple container which contains the minimum information necessary
> > to define an abstract search query.  It is not specific to any index or
> > collection of documents.
> 
> Mostly.  How about "is not tied to any specific index"?   

That's an improvement... we're getting there...

> And why "simple" and "minimal"?

How does this look?

  A Query is a specification which defines a search.

  High-level Query classes are typically not tied to any index, corpus,
  schema, or scoring model. 

  Lower-level Query classes may incorporate arbitrarily specific criteria,
  such as corpus statistical data or weighting calculations peculiar to a
  given scoring model.

> > A Query object cannot perform highlighting because it does not have access
> > to weighting information and thus cannot calculate the score for each
> > snippet.
> 
> That sounds easy to fix:  we've got a boost field, let's use it!   If more
> than a floating point boost field is needed, let's add those fields to
> Query.

I would prefer not to go down this path, for two reasons.

First, as described above, "boost" alone isn't enough -- and I don't think it's
a good idea to pollute our core Query classes with TF/IDF-specific member
variables and methods.

Second, under the current system, our high-level Query objects are reusable.
If we were to embed per-corpus weighting information within them, though, they
become one-shot.  That increases complexity for the end user, because it
becomes important that you always create new Query objects -- and if you aren't
fastidious about following that rule, you'll experience subtle scoring bugs
when your "boost" gets double-calculated.

> > We should probably formalize the notion that Lucy is not tied to any
> > particular query language.
> 
> Yes.  Also, there should be nothing magic about the standard parser.

Exactly.  Lucy::Search::QueryParser happens to implement one particular query
language, but that language is not the canonical interface to Lucy -- there
are other ways to specify search criteria.

> TF/IDF (I'm not actually against it, just having it define our 
> architecture) requires access to full collection statistics
> (Searcher), but can't this be done at Query creation or just after?
> 
> query = new Lucy::Query("this AND that");
> Lucy::TFIDF::Boost(query, Searcher);
> 
> query = new Custom::Query("este & ese");
> Custom::Boost(query, Searcher, IP, flags, whatever);
> 
> query = new One::Stop::Boosted::Query("user input", flags, boost_parameters);

Query objects are often created directly by a user.  We should not modify such
Queries by overwriting the user-supplied boost with a derived, corpus-weighted
boost.

Query objects may also be weighted in a PolySearcher and then passed down into
a child Searcher.  It is essential that the child Searcher know that weighting
has already been performed and must not be performed again.

Right now, weighting is performed during the Make_Compiler() phase.
Make_Compiler() always returns a new object.  Perhaps we might eliminate
Compiler and change the weighting phase to return a Query instead -- but it
should still always return a new object.  The returned value might be a clone
or the original Query, or it might be a wrapper which references it -- it
doesn't matter so long as we avoid modifying the original.

> The key for me is that there is a standard Query that gets created.

Right now, we check whether weighting has been performed by testing whether
the supplied Query subclasses Compiler:

    Compiler *compiler = Query_Is_A(query, COMPILER)
                       ? (Compiler*)INCREF(query)
                       : Query_Make_Compiler(query, (Searcher*)self, 
                                             Query_Get_Boost(query));

If we eliminate Compiler, we need to compensate by adding, say, a boolean
"weighted" attribute to Query so that we can tell when weighting is already
complete.

    Query *weighted_query = Query_Weighted(query)
                          ? (Query*)INCREF(query)
                          : Query_Weight(query, (Searcher*)self, 
                                         Query_Get_Boost(query));

> It can be a subclass of of course, but shouldn't be required to be
> one.   And once the (optional) boosting is done, it's standalone and
> ready to be passed off a a machine that does not have access to the
> full corpus statistics.

Aside from the subclass requirement, you've just described today's status quo
with Compiler.  :)

> > A Compiler/Investigation is created using the combination of a Query and a
> > Searcher.  It inspects the Query and performs weighting based on statistics
> > obtained from the Searcher.  Once the weighting is finished, it is ready to
> > perform its role as a factory which creates Matchers from supplied SegReaders.
> 
> Ditch it.  Or in keeping with the "no MatchEngine" approach, use it as
> an internal class to create a Matcher or MatcherFactory, but never
> pass a Compiler object around.  Matcher can have reference to it's
> parent Query if needed.

Hmm.  In order to avoid modifying the user-supplied Query, the output of the
weighting phase needs to be a new object, and we need to pass that object
around.  Are you OK with that so long as it's a Query?

> > The advantage of delegating to Compiler/Investigation is that Highlighter can
> > be made to work against arbitrary Query subclasses -- Highlighter itself
> > doesn't need to be extended.
> 
> Great advantage, let's keep it.  But once Query is pre-weighted this
> is solved, right?

Well, we can't just move this functionality up into TermQuery, PhraseQuery,
etc.  For that to work, our top-level Query classes would have to be tied to
scoring models -- unfortunately, just knowing "boost" isn't enough to score a
snippet.

> > A Matcher is a low-level object which operates against a single segment of an
> > index.  It determines which documents within that segment match the Query, and
> > optionally, calculates a score expressing how well each document matched.
> 
> Remind me:  how does the Matcher get associated with the segment?

The index segment is represented by a SegReader, which is a required argument
to Compiler's Make_Matcher() method:

    my $matcher = $compiler->make_matcher(reader => $seg_reader);

> The problem is that right now there are lots of dependencies on those pieces
> that are "underneath" Matcher, and generally lots of interdependence between
> the pieces that make changing out any single piece really hard.

We improved things considerably during the round of refactoring that produced
Compiler in 2008, and I hope that this time we get close to the theoretical
ideal give our constraints.

> > Right now, Query creates Compiler and Compiler create Matcher -- but that
> > means that in order to plug in an alternate scoring engine, you have to
> > subclass every Query class.
> 
> Which is clearly insanity, but is not actually due to
> Query->Compiler->Matcher. Rather, it's the continued interdependence
> after each creates the other, in the sense that each build on the
> other only a small amount.  But I think we can solve this quite easily
> at least for TF/IDF, which will continue to be the default scoring,
> and this will give us a much easier path forward for alternative
> scoring arrangements.

Our various Compiler subclasses are not that small.  

    * They handle TF/IDF calculation and weighting propagation.
    * They handle generation of scoring Spans used by the Highlighter.
    * They perform certain optimizations when compiling a Matcher.  (For
      instance, a single-element ANDCompiler will return the bare inner
      Matcher rather than wrapping it in a single-element ANDMatcher).
    * If/when we add "explain" functionality, Compiler (or whatever it
      becomes) will play a major role.

> But if the goal is to provide a better foundation strip out TF/IDF and then
> put it back in a sane manner.   

I like the way you put that.  :)

> I think thinking in terms of Boosting a Query is a good approach, and I
> think it can be done with standard query fields rather than per scorer
> subclasses.

I don't think there's any way to avoid those subclasses.  The weighting phase,
for TF/IDF and other scoring models, requires that intermediate data be saved
in member variables beyond the standard query fields.

Brainstorming here... We could potentially have the TFIDF-specific classes be
more ephemeral.  They could live only long enough for us to complete the
weighting phase and calculate a floating point boost.  We would then clone the
original Query, assign the new boost (overwriting the user-supplied boost), set
its "weighted" attribute to "true", and then use this weighted query object as
a spec to generate Matchers.

However, if we do that, we can no longer have the weighted Query object
(currently a Compiler) serve as a factory for Matchers, because high-level
Query objects must not be tied to specific scoring models.  It will have to be
the Searcher instead which determines the mapping between Query and Matcher.  

I'm not sure that's better.  It hides the per-Matcher Query subclasses more
effectively, but it's artificially constraining.  Why must we limit the
information we generate during the weighting phase and provide to the Matcher
factory to a single floating point boost?  We need the per-Matcher Query
subclasses anyway.

Marvin Humphrey


Mime
View raw message