Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <20110413192109.GA30277@rectangular.com>
References: <20110401004129.GA14002@rectangular.com> <4D9715C9.206@peknet.com>
 <20110402180310.GA13116@rectangular.com>
 <BANLkTike5xkNhm6cYxCv8cNaL3GUtxJiAA@mail.gmail.com>
 <20110403012758.GA13878@rectangular.com>
 <BANLkTinpRCAdzNsy8VLopk493HAmZ+-7EQ@mail.gmail.com>
 <20110407232933.GB31358@rectangular.com>
 <BANLkTinOEs9v=0UOQ-6nBUkFFtEMmx=84A@mail.gmail.com>
 <20110412040710.GA32546@rectangular.com>
 <BANLkTikxjiSWcrL67gTXoqJ_Zqz8k+xXLQ@mail.gmail.com>
 <20110413192109.GA30277@rectangular.com>
From: Nathan Kurz <nate@verse.com>
Date: Wed, 13 Apr 2011 21:01:51 -0700
Message-ID: <BANLkTikr0NQFg9bDe812hF1GN4_AH0aLWQ@mail.gmail.com>
To: lucy-dev@incubator.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [lucy-dev] Exile query weighting code to Siberia

On Wed, Apr 13, 2011 at 12:21 PM, Marvin Humphrey
<marvin@rectangular.com> wrote:
> Moving weighting out of the library and into application space would incr=
ease
> the complexity of user code, ipso facto:
>
> =C2=A0 =C2=A0 my $query =3D $query_parser->parse($query_string);
> + =C2=A0 =C2=A0$query =3D $query->weight(searcher =3D> $searcher);
> =C2=A0 =C2=A0 my $hits =3D $searcher->hits(query =3D> $query);

It doesn't have to be in the application level --- I'd be perfectly
happy to have it happen in the query parser, so long as the query
parser was clearly written and self-contained, so that one could
confidently rewrite it to use a different weighting scheme without
full knowledge of everything that happens afterward.  Gravy would be
if that query parser was contained in a subclass specific to TFIDF:

my $query =3D new Lucy::TFIDF::Query($query_string);

> For TF/IDF, queries should *always* be weighted, so if we made this chang=
e the
> user would simply become responsible for manually executing a step that L=
ucy
> performs automatically right now.

Sure, but so long as the rules are clear it isn't that onerous.  The
reality is that most new users are going to cut and paste from your
sample program, and so long as the sample includes this line they are
unlikely to go out of their way to remove it.

> I think many users would be surprised and confused if we started requirin=
g
> them to take charge of query weighting.

I might even have to start counting on my toes! ;)

> The proposal makes perfect sense, though, if scoring isn't important to y=
ou.

Or if scoring is very important to you.  It makes less sense if what
you want is an out-of-the-box no configuration search box for your
text based web site.

> What if Lucy was a boolean matching engine, which you could hack to augme=
nt
> with TF/IDF scores? =C2=A0What if TF/IDF was an add-on, and all TF/IDF we=
ighting
> code lived outside of core? =C2=A0What if only a tiny fraction of Lucy's =
users
> needed to weight their queries?

There's of course the question about what Core means here.  I think
TF/IDF should certainly be part of the core distribution, but it would
be great if it could be compartmentalized.

> If all that were true, Lucy's internals could be simplified considerably.=
 =C2=A0All
> of the weighting code would be gone -- we wouldn't have to think about it=
 in
> either single-node or search-cluster context. =C2=A0Lucy::Search::Compile=
r would be
> gone and we would all just pass around Query objects. =C2=A0Only the TF/I=
DF weirdos
> would stuff those bizarre calls to $query->weight into their application
> code...

I can't quite tell how much I'm being mocked here.  I guessing you're
trying your best to express a point of view that you don't quite
share.  No offense in either case, though, as I'm sure many things I
suggest are quite deserving of considerable mockery.

Everyone needs their queries to be weighed in some way, even if that
weighting is constant.  TF/IDF is a fine and venerable default
weighting, if you happen to be indexing books, or blog posts or
magazine articles.   But if you are indexing something like names,
titles, lists of properties, inverse document frequency doesn't have
the same resonance.

And although it may be largely semantic, I really do like the idea of
passing around a Query rather than a Compiler.  Especially if we could
keep the Query as simply a canonical representation of a search
request, and split all the other duties off into their own well
contained classes.

>=C2=A0If you are browsing through the Lucy code base trying to
> understand how everything fits together -- or trying to implement your ow=
n
> matching framework on top of those Query classes -- that's going to make
> things a lot easier.

I do think that simplifying the structure would go a long way in
making modifications more accessible.  Compiler really feels like a
catch-all, and yet it's not even in it's own hierarchy.  Pop quiz:
how many people on this list know that the code for the TF/IDF
specific TermCompiler be found in the file TermQuery.c?    And  how
many of those think it belongs there?

>> I know this doesn't currently exist, but your MatchEngine and
>> Lucy::Score::TFIDF* hierarchy feels like a good direction to explore.
>
> Groovy. =C2=A0Though I'm not sure where the TF/IDF code will end up yet, =
I think
> simplifying the *Query.c files ought to be one of the goals of this
> refactoring round.

Sounds like a great goal to me!

--nate