lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] Who should invoke query weighting?
Date Thu, 14 Apr 2011 20:21:28 GMT
On Wed, Apr 13, 2011 at 09:01:51PM -0700, Nathan Kurz wrote:
> On Wed, Apr 13, 2011 at 12:21 PM, Marvin Humphrey
> <marvin@rectangular.com> wrote:
> > Moving weighting out of the library and into application space would increase
> > the complexity of user code, ipso facto:
> >
> >     my $query = $query_parser->parse($query_string);
> > +    $query = $query->weight(searcher => $searcher);
> >     my $hits = $searcher->hits(query => $query);
> 
> It doesn't have to be in the application level --- I'd be perfectly
> happy to have it happen in the query parser, 

Sorry, but I feel very strongly that QueryParser should not be given
responsibility for query weighting.

First, QueryParser doesn't have corpus statistics available to it.  It *can't*
perform query weighting unless we start giving it access to a Searcher, which
would be problematic.  (If QueryParser's constructor starts requiring a
Searcher, you won't be able to parse a query string without access to an index,
which is silly.  If the Searcher is added as an optional parameter, then
QueryParser will have subtly different behavior depending on whether the
Searcher was supplied, resulting in hard-to-debug scoring changes.)

Second, QueryParser should not be tied to a scoring model.

Third, QueryParser is too big and has too much responsibility already.  It's
also our most security sensitive class and we need to keep it locked down, not
open it up.

Finally, adding query weighting to QueryParser makes it more challenging to
subclass.  We've spent a lot of time making it easier to hack up a custom
query parser, which is very important because everybody hates everybody else's
query parser and our pressure release valve is being able to say "you can roll
your own!"  It would be counterproductive to make subclassing QueryParser more
difficult by complicating its role and increasing its cognitive burden.

> > For TF/IDF, queries should *always* be weighted, so if we made this change
> > the user would simply become responsible for manually executing a step
> > that Lucy performs automatically right now.
> 
> Sure, but so long as the rules are clear it isn't that onerous.  The
> reality is that most new users are going to cut and paste from your
> sample program, and so long as the sample includes this line they are
> unlikely to go out of their way to remove it.

I strongly disagree here, too, I'm afraid.  I think adding a hard-to-grok
manual weighting step would be suboptimal interface design.  I don't relish
the prospect of inserting that weighting line into our tutorial documentation
or explaining why it's necessary to a new user in a support email.  

I also dislike it that the user would get subtly degraded search results
rather than catastrophic failure if they neglect to weight their query.  Query
weighting should happen automatically, and it is worth increasing the
complexity of the library code somewhat to make sure that it happens right
100% of the time.

In my opinion, the natural choice for the invoker which controls when query
weighting happens is Searcher.  

Searcher always has access to the corpus-wide statistics which are needed by
various weighting algorithms.  Indeed, gathering together a corpus comprising
one or more indexes across one or more machines is one of the main reasons we
need a Searcher abstraction layer.

Additionally, if query weighting happens internally within Searcher, we can
always know that the *right* corpus-wide statistics were used, whereas if
weighting is done externally, errors are possible.

There's still the question of which class should control what weighting
actually gets done -- Compiler, MatchEngine, etc. -- but regardless, I think
it makes sense for query weighting to happen within the scope of a Searcher.

> > The proposal makes perfect sense, though, if scoring isn't important to you.
> 
> Or if scoring is very important to you.  

Haha, touché!

I hope that the end result of this round of refactoring accommodates power
users without making life more difficult for casual users.

> It makes less sense if what you want is an out-of-the-box no configuration
> search box for your text based web site.

Scoring is very important to our casual users, even if they don't understand
the gory details of how it works.

> > If all that were true, Lucy's internals could be simplified considerably.  All
> > of the weighting code would be gone -- we wouldn't have to think about it in
> > either single-node or search-cluster context.  Lucy::Search::Compiler would be
> > gone and we would all just pass around Query objects.  Only the TF/IDF weirdos
> > would stuff those bizarre calls to $query->weight into their application
> > code...
> 
> I can't quite tell how much I'm being mocked here.  I guessing you're
> trying your best to express a point of view that you don't quite
> share.

It was an exercise in role reversal; I was trying to envision and depict a
situation in which TF/IDF was a second class citizen.  Looks like I could have
done a better job.  :\

> No offense in either case, though, as I'm sure many things I
> suggest are quite deserving of considerable mockery.

On a lot of lists it takes some dumbass bomb thrower to barge in and bellow
"yer project sucks!" to get a lively discussion going.

We're fortunate to have you as a gadfly instead.

Cheers,

Marvin Humphrey


Mime
View raw message