lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Automagic phrase search
Date Mon, 22 Aug 2011 00:29:14 GMT
On Sun, Aug 21, 2011 at 09:34:58PM +0200, Moritz Lenz wrote:
> One of my conclusions is that Google & co. usually treat the order of
> search terms as an important indicator for relevance, while most other
> search engines don't.

Yep.  A technique was described in section 4.5.1 of the seminal Brin/Page 1998
paper, "The Anatomy of a Large-Scale Hypertextual Web Search Engine":

    http://infolab.stanford.edu/~backrub/google.html

    For a multi-word search, the situation is more complicated. Now multiple
    hit lists must be scanned through at once so that hits occurring close
    together in a document are weighted higher than hits occurring far apart.
    The hits from the multiple hit lists are matched up so that nearby hits
    are matched together. For every matched set of hits, a proximity is
    computed. The proximity is based on how far apart the hits are in the
    document (or anchor) but is classified into 10 different value "bins"
    ranging from a phrase match to "not even close".  Counts are computed not
    only for every type of hit but for every type and proximity. Every type
    and proximity pair has a type-prox-weight. The counts are converted into
    count-weights and we take the dot product of the count-weights and the
    type-prox-weights to compute an IR score.

Search query strings entered by humans generally carry positional information
and it is a shame to throw it away.

> I'd love to have a mechanism in lucy to provide an automagic phrase search
> as above, which honors the order of search words even outside an explicit
> phrase search, and less restrictively than an explicit phrase search.

You can get something close to that in Lucy by augmenting ordinary searches
with parallel proximity queries. 

    my $foo_query = Lucy::Search::TermQuery->new(
        field => 'content',
        term  => 'foo',
    );
    my $bar_query = Lucy::Search::TermQuery->new(
        field => 'content',
        term  => 'bar',
    );
    my $foo_or_bar_query = Lucy::Search::ORQuery->new(
        children => [ $foo_query, $bar_query ],
    );
    my $foo_near_bar_query = LucyX::Search::ProximityQuery->new(
        field  => 'content',
        terms  => [qw( foo bar )],
        within => 20,
    );
    my $top_level_query = Lucy::Search::ORQuery->new(
        children => [ $foo_or_bar_query, $foo_near_bar_query ],
    );
    my $hits = $searcher->hits( query => $top_level_query );
    ...

> Is something like that already implemented, and if no, is it on any agenda?

I tried a while back to work this into Lucy at a low level, optimizing both
index data structures and search time object hierarchies to support it.
Ultimately, I pulled that code out because the path I'd chosen wasn't going to
make automatic proximity support feasible without negatively impacting
ordinary searching.

One remnant of that attempt is Lucy::Plan::Architecture.  Part of the
motivation for allowing arbitrary index structures via Architecture
was to facilitate future experimentation with automatic proximity support.

> I know far too little about Lucy's internal workings to know if that's
> easy or even possible, but for me it would be a real killer feature.
> If somebody points me the direction where to start I might even give it
> a try, though my C fu is mediocre at best.

Can you work with Parse::RecDescent?

What would be really useful is a query parser which automatically generates
query structures like the one built up manually above.  

If that interests you, I suggest going through the Lucy tutorial if you
haven't already, as the Lucy::Docs::Tutorial::QueryObjects chapter contains
relevant material, then checking out Lucy::Docs::Cookbook::CustomQueryParser
and Lucy::Docs::Cookbook::CustomQuery.

Cheers,

Marvin Humphrey


Mime
View raw message