incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-dev] Refining Query-to-Matcher compilation
Date Mon, 11 Apr 2011 07:22:55 GMT
On Sun, Apr 10, 2011 at 1:01 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Sat, Apr 09, 2011 at 11:59:11PM -0700, Nathan Kurz wrote:
> Say that you have two documents: "foo bar" and "foo foo foo bar".  The
> document frequency for "foo" is 2, but its corpus frequency is 4.

Yes, that was what I thought the paper said.  Thanks.

>> Are there other scoring methods that you anticipate as useful?
>
> I've long wanted to work up an RTree implementation.

An answer which highlights our odd relationship between scoring and
index format, but I'd like to see spatial search added as well.

>> > When weighting an arbitrarily complex query, we have to allow the scoring
>> > model the option of having member variables and methods which perform the
>> > weighting, and we have to allow for the possibility that it will proceed in
an
>> > arbitrary number of stages, requiring gradual modifications to complex
>> > internal states before collapsing down to a final "weight" -- if it ever does.
>>
>> Does your "if ever" imply that we indeed should try to support scorers
>> that might return additional information beyond a single float, such
>> as field name, position data, or matched string?
>
> That's the gist.
>
> We start in the top-level Searcher with a Query object, either supplied
> directly by the user or implicitly parsed from their supplied query string.
> We then weight the Query, producing a new Query with its "weighted" attribute
> set to "true".

Modulo the question of whether we do it or the user does, and
hopefully avoiding the 'attribute' approach, this sounds great.

> The weighted Query may serve as a vessel for arbitrary information augmenting
> the original Query.  It would be artificially constraining for us to limit the
> auxiliary data to a single float.

Yes, I think a single float would probably be too constraining.  But
I'm wondering if we could come up with a very short list of variables
that would work for the majority of scoring needs, and then have an
extension mechanism that could work for the rest.  This extension
could be a subclassed query, or could just be the equivalent of a void
* which a particular Matcher/Scorer knows how to handle.

>> I'd like to be able to do this, but don't see an easy framework.
>
> We're doing it now -- the Compiler object is our weighted Query.

I guess I meant:  I don't see an easy framework that allows someone to
use an alternate scoring system without getting intimate with the
entire system.  And that I don't see a way to make that scoring system
work across different index formats.

>> Also, do you feel a Scorer needs to be able to do "incremental"
>> scoring, or is it OK if scoring is only possible after a Matcher has
>> finished?
>
> (Nit: we don't have a Scorer class any more -- Matcher replaced it.)

Sorry for my combination of lack of clarity and befuddlement.  I guess
what I'm searching for is a way cleave apart the Matcher from the
Scorer, so that Matcher end up Index specific but Scorers are not.  So
that when one want to write a custom scorer, one needs only to
understand Matcher rather than the underlying data format.  Ideally,
one would be able to add an index format and have existing scorers
work without modification, and add a scorer that works across a
variety of formats.

>> Essentially, will it ever be necessary to score a subquery so that a Matcher
>> can decide whether to skip to the next document?
>
> I'm not sure I grok completely, but I think the answer is yes.

Terrible phrasing on my part, but what I'm asking is whether it is a
requirement that we score incrementally, or just a contingency of our
current approach.  If we could come up with an index specific Matcher
that defined all the standard query types (And, Or, Term), could it
pass off to a Scorer at the end?

> Instead, you have to see whether "foo" matches and calculate a score
> incidentally, then see whether "bar" matches and calculate a score
> incidentally, and then if both match, you calculate an aggregate score using
> the scores for the "foo" and "bar" subqueries.

And no hope for a binary Yes/No (with allowed false positives) that
determines roughly whether a document deserves to be scored and puts
together something in an Index independent "standard" format that that
a Scorer can read?  Because if this could be done efficiently, it
seems like the way out for having custom scoring interacting with
custom back ends without extreme duplication of effort.

--nate

Mime
View raw message