incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-dev] Refining Query-to-Matcher compilation
Date Tue, 12 Apr 2011 07:09:34 GMT
On Mon, Apr 11, 2011 at 9:07 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Sun, Apr 10, 2011 at 12:08:05PM -0700, Nathan Kurz wrote:
>> > Query objects may also be weighted in a PolySearcher and then passed down into
>> > a child Searcher.  It is essential that the child Searcher know that weighting
>> > has already been performed and must not be performed again.
>>
>> I feel this is an architectural flaw, and that the correct solution is
>> that weighting should never be performed automatically.
>
> Unfortunately, I don't see how that could work.  Weighting isn't optional.
>
> The process of "weighting" a query under TF/IDF involves weighting subqueries
> using IDF, so that if you search for 'new york', the rare term 'york'
> contributes more towards the score than the common term 'new'.
>
> If you don't do that, your search results aren't going to be as relevant as
> they should be -- you're going to get too much 'new' and not enough 'york'.

Yes.  I wasn't trying to say that it shouldn't be weighted, but that
the weighting should be explicit rather than automatic.   I was
suggesting that instead of checking whether the weighting has already
been done, we provide a means for the weighting to be done and simply
require it be used.  This is just from general desire to make the code
paths as simple and explicit as they can be.

>> Assume there is a machine with an known index schema and a net connection:
>> exactly what do we need to specify over-the-wire to get the results we want?
>
> We need to send over a serialized Query which has already been weighted using
> aggregate statistics for the entire corpus.
>
> Right now, that means the Query must be a Lucy::Search::Compiler object.

This is sad, but a lot of my difficulties might be purely semantic.  I
have trouble with Compiler a subclass of Query, and am only starting
to understand what you meant by "High Level Query" and "Low Level
Query" in some earlier mail.  And because of some earlier phrasing
about "serializing the Query" I just wasn't seeing that it was
actually a Compiler.  I thought there was yet another entity involved.

I think it's the combination of wrapping a Query and being a Query
that confuses me.

So Compiler inherits from Query (and thus is a "low level query"?),
but TermCompiler does not inherit from TermQuery?.  I guess it's that
I want them to either always be subclasses or never be, but I'm uneasy
about the halfways.  I feel like Compiler is trying to do an awful lot
of things, few of which really are reflected in its name or parentage.

And what would a non-TF/IDF specific form of  Lucy::Index::Similarity be called?

> What I think you may be missing is that we need ANDCompiler and TermCompiler
> in order to *calculate* the values that you would have us insert into ANDQuery
> and TermQuery.  The complex code that performs TF/IDF weighting has to go
> *somewhere* -- TermCompiler and ANDCompiler are that "somewhere".  Even if we
> we were to stop using them as containers, we can't kill them off.

I'm missing a lot, but that one I'm getting.  My reference to the
nonexistent "Scorer" is my attempt to find a proper place for it,
where proper is just about anywhere with a clearly delineated
boundary.   I know this doesn't currently exist, but your MatchEngine
and Lucy::Score::TFIDF* hierarchy feels like a good direction to
explore.

My latest mental failures have been trying to figure out how to
shoehorn in geographic distance subqueries.   Should be simple, right?

--nate

Mime
View raw message