lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] Exile query weighting code to Siberia
Date Wed, 13 Apr 2011 19:21:09 GMT
On Tue, Apr 12, 2011 at 12:09:34AM -0700, Nathan Kurz wrote:
> I wasn't trying to say that it shouldn't be weighted, but that the weighting
> should be explicit rather than automatic.   I was suggesting that instead of
> checking whether the weighting has already been done, we provide a means for
> the weighting to be done and simply require it be used.  

Moving weighting out of the library and into application space would increase
the complexity of user code, ipso facto:  

     my $query = $query_parser->parse($query_string);
+    $query = $query->weight(searcher => $searcher);
     my $hits = $searcher->hits(query => $query);
     ...

For TF/IDF, queries should *always* be weighted, so if we made this change the
user would simply become responsible for manually executing a step that Lucy
performs automatically right now.  Understanding query weighting is hard, and
we could expect the user to make errors at least some of the time.  That's
unfortunate, since Lucy currently performs this step correctly 100% of the
time; we should expect both Lucy's average accuracy and user satisfaction to
drop.  

I think many users would be surprised and confused if we started requiring
them to take charge of query weighting.  Heck, a lot of the people on this dev
list might wonder why on earth we'd consider such a crazy idea. :)

The proposal makes perfect sense, though, if scoring isn't important to you.

What if Lucy was a boolean matching engine, which you could hack to augment
with TF/IDF scores?  What if TF/IDF was an add-on, and all TF/IDF weighting
code lived outside of core?  What if only a tiny fraction of Lucy's users
needed to weight their queries?

If all that were true, Lucy's internals could be simplified considerably.  All
of the weighting code would be gone -- we wouldn't have to think about it in
either single-node or search-cluster context.  Lucy::Search::Compiler would be
gone and we would all just pass around Query objects.  Only the TF/IDF weirdos
would stuff those bizarre calls to $query->weight into their application
code...

> This is just from general desire to make the code paths as simple and
> explicit as they can be.

Imagine Lucy without Lucy::Search::Compiler and all of its subclasses.  

All of a sudden, TermQuery.c, PhraseQuery.c, etc. get much smaller.  Before:

  $ find core/Lucy/Search -print | ack "Query.c\b" | xargs wc -l
     134 core/Lucy/Search/ANDQuery.c
     109 core/Lucy/Search/LeafQuery.c
      96 core/Lucy/Search/MatchAllQuery.c
     142 core/Lucy/Search/NoMatchQuery.c
     133 core/Lucy/Search/NOTQuery.c
     139 core/Lucy/Search/ORQuery.c
     396 core/Lucy/Search/PhraseQuery.c
     194 core/Lucy/Search/PolyQuery.c
      51 core/Lucy/Search/Query.c
     276 core/Lucy/Search/RangeQuery.c
     138 core/Lucy/Search/RequiredOptionalQuery.c
     258 core/Lucy/Search/TermQuery.c
    2066 total

After:

  $ find core/Lucy/Search -print | ack "Query.c\b" | xargs wc -l
      72 core/Lucy/Search/ANDQuery.c
     100 core/Lucy/Search/LeafQuery.c
      56 core/Lucy/Search/MatchAllQuery.c
     103 core/Lucy/Search/NoMatchQuery.c
      68 core/Lucy/Search/NOTQuery.c
      69 core/Lucy/Search/ORQuery.c
     131 core/Lucy/Search/PhraseQuery.c
     102 core/Lucy/Search/PolyQuery.c
      51 core/Lucy/Search/Query.c
     158 core/Lucy/Search/RangeQuery.c
      81 core/Lucy/Search/RequiredOptionalQuery.c
      95 core/Lucy/Search/TermQuery.c
    1086 total

Not only do we cut file size in half, but what's left in those files is basic
container code.  If you are browsing through the Lucy code base trying to
understand how everything fits together -- or trying to implement your own
matching framework on top of those Query classes -- that's going to make
things a lot easier.

Of course we cannot simply eliminate all that TF/IDF code -- but we can stuff
it into a dark corner such as Lucy::Score or Lucy::TFIDF.  Very few people are
going to want to mess with it or study it.

> I know this doesn't currently exist, but your MatchEngine and
> Lucy::Score::TFIDF* hierarchy feels like a good direction to explore.

Groovy.  Though I'm not sure where the TF/IDF code will end up yet, I think
simplifying the *Query.c files ought to be one of the goals of this
refactoring round.

Marvin Humphrey


Mime
View raw message