incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-dev] Who should invoke query weighting?
Date Fri, 15 Apr 2011 09:49:00 GMT
On Thu, Apr 14, 2011 at 1:21 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Wed, Apr 13, 2011 at 09:01:51PM -0700, Nathan Kurz wrote:
>> On Wed, Apr 13, 2011 at 12:21 PM, Marvin Humphrey
>> <marvin@rectangular.com> wrote:
>> > Moving weighting out of the library and into application space would increase
>> > the complexity of user code, ipso facto:
>> >
>> >     my $query = $query_parser->parse($query_string);
>> > +    $query = $query->weight(searcher => $searcher);
>> >     my $hits = $searcher->hits(query => $query);
>>
>> It doesn't have to be in the application level --- I'd be perfectly
>> happy to have it happen in the query parser,
>
> Sorry, but I feel very strongly that QueryParser should not be given
> responsibility for query weighting.

As someone who is arguing that TF/IDF needs to be better
compartmentalized, I completely agree with you:  QueryParser should
not do anything that is specific to any particular scoring method.
But BM25QueryParser, or Lucy::TFIDF::Query, or whatever
scorer-specific subclasses we provide could do it.

> First, QueryParser doesn't have corpus statistics available to it.  It *can't*
> perform query weighting unless we start giving it access to a Searcher, which
> would be problematic.  (If QueryParser's constructor starts requiring a
> Searcher, you won't be able to parse a query string without access to an index,
> which is silly.  If the Searcher is added as an optional parameter, then
> QueryParser will have subtly different behavior depending on whether the
> Searcher was supplied, resulting in hard-to-debug scoring changes.)

I was thinking you could choose the parser you want for the purpose
you need to accomplish.  But maybe we're thinking differently, as you
talk of a QueryParser constructor.  I think of it as a as Query
constructor, a function that happens to live in the QueryParser
package.

> Second, QueryParser should not be tied to a scoring model.

Maybe we view subclassing differently.   The general purpose stuff
goes in the base class, and the more specific stuff goes in the
subclasses.  Ideally the child extends the parent.  You prefer it the
other way around? :)

> Third, QueryParser is too big and has too much responsibility already.  It's
> also our most security sensitive class and we need to keep it locked down, not
> open it up.

Security is not something I had considered.  Yes, a concern, although
I see the subclassing approach as bypassing this.

> Finally, adding query weighting to QueryParser makes it more challenging to
> subclass.

I think having multiple examples of how to subclass it for different
weighting schemes would actually make things easier.   In general, I
think examples are great.  What would really solve our issues would be
for you to code up a couple disparate weighting schemes to show how it
should be done.  If it turns out to be easy and clear, I'd be the
first to cheer your heroism and valor.

> I don't relish
> the prospect of inserting that weighting line into our tutorial documentation
> or explaining why it's necessary to a new user in a support email.

If that's the only obstacle, I could commit in advance to answering
all such questions as they pop up.  I'd even write that line in the
tutorial!

> I also dislike it that the user would get subtly degraded search results
> rather than catastrophic failure if they neglect to weight their query.

I recognize the issue, but I just don't see it as being an problem.
Again, I'm not arguing that it needs to be done manually in any way,
just that it should be encapsulated.  You could have base QueryParser
die with an error if you feel it's necessary.

> In my opinion, the natural choice for the invoker which controls when query
> weighting happens is Searcher.

You mean Compiler weighting, right?   It would be great if we could
call it a weighted query.  Or a Lucy::TFIDF::Query.  Because then we
wouldn't have to wonder why we need to send gcc and CLang across the
net every time we want search results from a remote machine.

> Searcher always has access to the corpus-wide statistics which are needed by
> various weighting algorithms.  Indeed, gathering together a corpus comprising
> one or more indexes across one or more machines is one of the main reasons we
> need a Searcher abstraction layer.

This is a reasonably persuasive argument.   I think the only reason it
doesn't convince me is that the weighting schemes that interest me
probably won't be accessed through Searcher.  And I find it
conceptually cleaner that the "query" includes the ordering details,
rather than having them added to an additional entity.   But as you
seem quite steadfast that Query is right as it is, no subclassing
unless they are called Compilers, I probably should give it up and
start figuring out how to make Searcher easier to work with.

> Additionally, if query weighting happens internally within Searcher, we can
> always know that the *right* corpus-wide statistics were used, whereas if
> weighting is done externally, errors are possible.

Or worse, you could even open the wrong index and find passages from
the constitution instead of your website content!

> It was an exercise in role reversal; I was trying to envision and depict a
> situation in which TF/IDF was a second class citizen.  Looks like I could have
> done a better job.  :\

You did a fine job.  Now just finish the role playing by writing a
dead simple static scoring class (one point per term) and showing me
how easy, readable, maintainable and self-contained it is.

> We're fortunate to have you as a gadfly instead.

Yer project sucks!  But you're very sweet to humor me.  :)

I'll try to move on to a better topic.

--nate

Mime
View raw message