incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Dependency injection and customizing a scorer
Date Mon, 23 May 2011 02:59:35 GMT
On Thu, May 19, 2011 at 01:15:47PM -0700, Nathan Kurz wrote:
> My concern is how the particular implementation of ORScorer is selected by
> the levels above it.  Currently, the entire hierarchy is hard-coded, so any
> changes require subclassing all the way from the top down to the point of
> change.

That's a nice, succinct description of the problem. :)

I agree that we want to perform dependency injection *somewhere* to map
between Query and Matcher classes.  Query classes should be decoupled from
matching/scoring, as should QueryParser.  

If we remove the hard-coded constructor calls from various Make_Compiler()
implementations, so that Queries no longer know how to spawn Compiler objects
-- and by extension, spawn Matchers -- *something* has to take over and assume
responsibility for deciding how to create a Compiler from a Query.  The
question is *where* the dependency injection should occur.

Here are some options -- not all of them good:

    1. Within each Query subclass, via per-class global variables.
    2. Within each Query object, via a CompilerFactory member variable.
    3. At the per-index level, via a MatchEngine which is a property of the
       Schema.
    4. At the Clownfish class loader level, by abusing the VTable_registry.

/***************** Option 1: Per-class global variables. *******************/

The first option, using per-class global variables, is easy to understand.
Pretending for a moment that ORQuery is coded in Perl, here's how we'd set
things up:

    package Lucy::Search::ORQuery
    use base qw( Lucy::Search::Query );

    my $compiler_class = "Lucy::Search::ORCompiler";
    sub set_compiler_class { $compiler_class = $_[1] }

    sub make_compiler {
        my $self = shift;
        return $compiler_class->new(@_);
    }

Registering your preferred Compiler subclass with ORQuery causes it to be used
*everywhere*:

    Lucy::Search::ORQuery->set_compiler_class("MyORCompiler");

The problem with this approach is that it is global.  Two different scoring
mechanisms cannot coexist within the same process.  Non-constant global
variables cause action-at-a-distance bugs:

    http://en.wikipedia.org/wiki/Action_at_a_distance_%28computer_science%29

I would oppose using non-constant globals this way in Lucy just as strenuously
as I opposed such usage in Lucene:

    http://s.apache.org/ns

    To be clear, I consider the idea of a settable global variable determining
    library behavior completely unacceptable. Changing class load order
    somewhere in your code shouldn't do things like change search results
    (because Stopfilters are applied differently depending on who "won").

/************** Option 2: Per-Query-object CompilerFactories ***************/

Our second option, "manual" dependency injection at the object level using the
"abstract factory" design pattern[1], does not suffer from the same defects --
though it has other problems.

Right now, in order to get a Compiler object out of a Query, you invoke the
Query's Make_Compiler() method.

    Compiler *compiler = Query_Make_Compiler(query, searcher, boost);

In the case of an ORQuery, the output will *always* be a
Lucy::Search::ORCompiler, because the call to ORCompiler_new() is hard-coded.
But if we give ORQuery a CompilerFactory member variable and have
Make_Compiler() delegate to it, it's no longer necessary to subclass ORQuery:

 Compiler*
 ORQuery_make_compiler(ORQuery *self, Searcher *searcher, float boost) {
-    return (Compiler*)ORCompiler_new(self, searcher, boost);
+    return CompilerFactory_Make_Compiler(self->compiler_factory, self,
+                                         searcher, boost);
 }


To deploy your custom Compiler subclass, you could theoretically touch every
ORQuery individually:

    package MyORCompilerFactory;
    use base qw( Lucy::Search::CompilerFactory );

    sub make_compiler {
        my $self = shift;
        return MyORCompiler->new(@_);
    }

    ...

    # Note that we use an ordinary ORQuery here, but modify it by setting a
    # custom CompilerFactory.
    my $or_query = Lucy::Search::ORQuery->new(%args);
    $or_query->set_compiler_factory(MyORCompilerFactory->new);

More likely, you would want to subclass QueryParser and have it do the work of
setting the CompilerFactory for each query object.

    package MyQueryParser;
    use base qw( Lucy::Search::QueryParser );

    sub make_or_query {
        my $self = shift;
        my $or_query = Lucy::Search::ORQuery->new(@_);
        $or_query->set_compiler_factory(MyORCompilerFactory->new);
        return $or_query;
    }

    ... 

    my $query_parser = MyQueryParser->new(schema => $searcher->get_schema);
    my $or_query = $query_parser->parse("foo OR bar");

There are at least two problems with this approach, though.

First, QueryParser should be decoupled from scoring.  QueryParser has
*plenty* of responsibility already -- we should not add to its burden.

Second, doing things this way requires client code to get a lot of little
things right in a lot of places.  Every time anybody creates an ORQuery,
set_compiler_factory() has to be called with the right value or you'll have
bugs.  It's not good API design for Query if we force client code to take on
that responsibility.

/******************** Option 3: Per-index MatchEngine. *********************/

The third option, a per-index MatchEngine[2], avoids all the problems we've
seen so far.  It's not global, yet the association of Query type and
CompilerFactory only happens in one place:

    package MyMatchEngine;
    use base qw( Lucy::Plan::MatchEngine );

    sub new {
        my $self = shift->SUPER::new(@_);
        $self->register_compiler_factory(
            query_type => "Lucy::Search::ORQuery"
            factory    => MyORCompilerFactory->new,
        );
        return $self;
    }

MatchEngine would be a property of the Schema[3].

    my $schema = Lucy::Plan::Schema->new;
    $schema->spec_field(name => 'title',   type => $text_type);
    $schema->spec_field(name => 'content', type => $text_type);
    $schema->set_match_engine(MyMatchEngine->new);
    ...

The *only* client code that has to "know" about the MatchEngine is the Schema
creation code.  All search-time client code will do the right thing
automatically, regardless of what MatchEngine is associated with a given
index.

    my $ordinary_searcher = Lucy::Search::IndexSearcher->new(index => $path);
    my $special_searcher  = Lucy::Search::Searcher->new(
        index => $path_to_index_with_schema_specifying_custom_match_engine,
    );

    # Use the same plain old ORQuery with both indexes.
    my $or_query = Lucy::Search::ORQuery->new(
        children => [
            Lucy::Search::TermQuery->new(field => 'title', term => 'foo'),
            Lucy::Search::TermQuery->new(field => 'title', term => 'bar'),
        ],
    );
    my $ordinary_hits = $ordinary_searcher->hits(query => $or_query);
    my $special_hits  = $special_searcher->hits( query => $or_query);

> > Writing extensible classes which are designed to be subclassed by users is a
> > very difficult computer science problem in general.
> 
> Yes, it's a hard problem, but over the last 10 years or so there has
> been a lot of progress on best practices under the general rubric of
> "Dependency Injection".      Particularly for a hybrid of compiled and
> interpreted code, we need a runtime means of choosing which
> implementation will perform a given service according to a fixed
> interface.  I think it's worth debating the means by which to solve
> this problem, but  we definitely benefit by improving the current
> approach.   But again, the problem isn't the ease of subclassing
> (which is great) rather the inability to substitute dependencies.

Is solving the dependency-substitution problem at the index level good enough?

I believe it will be possible to allow search-time overriding of an index's
default MatchEngine via IndexSearcher:

    my $searcher = Lucy::Search::IndexSearcher->new(index => $path_to_index);
    $searcher->set_match_engine(MyOtherMatchEngine->new);
    my $hits = $searcher->hits(query => 'foo OR bar');

Put another way: you won't have to regenerate your index in order to try out a
different MatchEngine.

/***************** Option 4: Abuse the class loader. ********************/
 
> > It would be a Bad Idea to replace an entry in the ORMATCHER VTable.  VTable
> > objects, once exposed, must not be modified in any way.  Instead, it's
> > important to create a *copy* of ORMATCHER using VTable_singleton() and replace
> > an entry in that.
> 
> It doesn't really matter much to me whether individual functions
> within ORMATCHER can be changed, or whether the entire entry needs to
> be swapped out and replaced.  What is important is that there be some
> stage at which the "label" ORMATCHER can safely be changed to point to
> a different "entry".  I'm hoping there is room for a setup stage where
> this is possible before the VTable is "exposed".
 
I included this proposal as the last option in the list of dependency
injection techniques above.  The idea is to support dependency injection by
allowing changes to the class-name-to-vtable mapping in the VTable_registry,
or possibly changes to the VTables themselves.

The fundamental problem is that VTables are globally visible quasi-singleton
variables, referenced by many other objects.  If we allow them to be mutable,
we end up with classic global variable problems: susceptibility to
action-at-a-distance behaviors, inability to support multiple configurations
simultaneously, etc.

For the sake of argument, this approach could be hacked into place by changing
VTable_registry back into a Hash instead of a LockFreeRegistry.  (Hash allows
entries to be replaced, while LockFreeRegistry does not.)  You could then
write the routines in VTable.c to allow arbitrary new association of class
names to VTables at any time.

But changing entries in the VTable_registry is analogous to taking an entire
compiled package in Perl and swapping it out for another one.  It's a scary,
hard-to-control technique more useful as an intellectual experiment than as a
practical device.

> I would like to know more where the badness comes from, though.

Using non-constant globals to determine behavior can work in small-scale
applications where everything about the everything is known to the developer.
But our shared VTable mechanism has to be able to work within large-scale
systems where the the developer might not know that some subcomponent of a
second-order dependency happens to be using Lucy.  (The Wikipedia article on
action-at-a-distance cited above does a good job of explaining why
non-constant global variables suck.)

That's the general critique.  In the specific case, our implementation of
VTable is thoroughly enmeshed with the host language's class hierarchy and
object system, and I cannot imagine a practical public API which enables the 
manipulation of our OO underpinnings this way.

/************************** [end of options] ******************************/

> My goal would be to make it possible to switch to a different ORScorer
> (again, arbitrary example) at runtime without having to fully understand the
> full class hierarchy or Clownfish.  By slightly repurposing the existing
> Registry, I think this would be possible quite easily.

Think of the MatchEngine proposal as "adding another registry".  :)

> The only other thing required is to convince you that it's worth moving what
> I feel is the current best practice of avoiding static dependencies between
> classes.  :)

+1

:)

This has been a long message, even for me, and I wonder how many lucy-dev
subscribers will make it all the way to the bottom.  But even if only Nate
reads all the way through, composing it has been a useful exercise: I now feel
ready to code up a prototype implementation of MatchEngine.

Marvin Humphrey


[1] There are 5 "creational patterns" described in the comp-sci classic
    "Design Patterns".  The classes described in this email use all 5:
        Abstract Factory: CompilerFactory.
        Factory Method:   Query_Make_Compiler(), Compiler_Make_Matcher(), etc.
        Builder:          MatchEngine.
        Singleton:        VTables are per-class singletons.
        Prototype:        Child VTables are cloned from parent VTables.
    <http://en.wikipedia.org/wiki/Design_Patterns#Creational_patterns>

[2] MatchEngine was previously described at <http://s.apache.org/W46>.

[3] I actually think MatchEngine should default based on the Architecture, so
    I'm lying a little when I say it should be a property of the Schema.  But
    Architecture is a property of the Schema, so from a high level it looks
    roughly the same.


Mime
View raw message