lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-dev] Refining Query-to-Matcher compilation
Date Sun, 10 Apr 2011 19:08:05 GMT
I'm going to try to chip off some small pieces and deal with them
individually.   As a result, I may have a number of threads going at
once.  Sorry for the profusion, but I'll try to get back to the big
picture by the end.

On Thu, Apr 7, 2011 at 4:29 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> Exactly.  Lucy::Search::QueryParser happens to implement one particular query
> language, but that language is not the canonical interface to Lucy -- there
> are other ways to specify search criteria.

Yes.  What I like about this is that we provide the user a tool
(QueryParser) to convert text into a Query, but that we don't require
them to use it.  If they want to create a conforming query by some
other means, they are welcome to do so.

Equally, if they want to start with a QueryParser generated Query and
adjust it, for example by adding an optimization pass, they can do so.
 Rather than passing in plain text and having the Query hidden in the
innards of Lucy, we expose the Query.

What I'm suggesting is that we do the same for scoring --- rather
giving the user lots of knobs to tweak that affect how the scorer will
happen inside a monolithic method, I want this to happen out in the
open.  We provide tools to do it easily, but other tools can be used.

>> TF/IDF (I'm not actually against it, just having it define our
>> architecture) requires access to full collection statistics
>> (Searcher), but can't this be done at Query creation or just after?
>>
>> query = new Lucy::Query("this AND that");
>> Lucy::TFIDF::Boost(query, Searcher);
>>
>> query = new Custom::Query("este & ese");
>> Custom::Boost(query, Searcher, IP, flags, whatever);
>>
>> query = new One::Stop::Boosted::Query("user input", flags, boost_parameters);
>
> Query objects are often created directly by a user.  We should not modify such
> Queries by overwriting the user-supplied boost with a derived, corpus-weighted
> boost.

Perhaps surprisingly, I agree with this:  the Query object should not
be changed as a side effect of running a search.  But in the examples
I'm giving above, "we" are not changing the Query, the user is.  We're
simply providing tools to let them do so efficiently, and
demonstrating the pattern by which other such tools can be written.

> Query objects may also be weighted in a PolySearcher and then passed down into
> a child Searcher.  It is essential that the child Searcher know that weighting
> has already been performed and must not be performed again.

I feel this is an architectural flaw, and that the correct solution is
that weighting should never be performed automatically.  It should be
an explicit step that happens under the control of the user, with Lucy
the library providing the tools to do so.   No flags, no checks, just
run it how it comes in.

I think the parallel with query optimization is accurate.  Query
optimization is a great thing, but it should not happen behind the
scenes.  It's OK if the default QueryParser does the optimization, but
the engine should run exactly the Query it's passed.  In the same way,
the weighting needs to be independent of the "engine".

Viewing everything as happening on a Child searcher on another
physical machine seems like a good approach.  Assume there is a
machine with an known index schema and a net connection: exactly what
do we need to specify over-the-wire to get the results we want?  This
is the degree of isolation we want when splitting up the phases.

I'm suggesting that we should be able to just serialize the Query and
specify which results we want returned.  Because the corpus statistics
are only known by the parent, to me it makes no sense to do the
weighting on the child: I think you essentially want to take the Query
and the Scorer and combine them into a single entity (Compiler),
whereas I want to keep them distinct.

But rather than discussing the abstract, I think we can focus on the
specific:  what information needs to be sent as part of the search
request for a specific cases? We want to search ["this" AND "that"],
weighted TF/IDF, returning top 10 scores.   What bytes form the
Request that we need to send to the child?

Presuming we know the full corpus statistics on the parent, I think we
can just serialize a pre-weighted query, specify the name of a Scorer
(one that adds subqueries), and that we want only the top 10 results.
I don't think the child needs to know whether we are using TF/IDF,
TF/IFC, or BM25.   What am I missing?

Probably lots.  I think I'm presuming that the weighting method can be
independent of the scoring method.   The methods you've mentioned
blend these two, but I think they can be separated.   Maybe they can't
be separated in general:  what if you wanted to specify that you
wanted words close the head of a document to be more valuable?  But
I'm hoping that this can be solved by adding some configuration
options to the Scorer name.

--nate

Mime
View raw message