incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] which fields contained which terms
Date Tue, 30 Aug 2011 21:59:06 GMT
On Tue, Aug 30, 2011 at 01:38:39PM -0500, Peter Karman wrote:
> Per the thread here from Feb 2011[0] I am want to make it easy to discover why a
> document matched a given query, i.e. which terms matched in which fields.
> 
> Marvin and I have chatted about this a few different times on #lucy_dev, and
> it's clear to me now why it is problematic to do this kind of data gathering in
> the existing Matcher/Collector architecture. Post-processing provides a cleaner
> way into the solution, provided we can do it without sacrificing performance.
> 
> I wanted to get this thread on to the -dev list as we need to sort out if/how
> the index structure might change to make this feature possible.

The general idea is to treat this problem like highlighting, which is also
done using post-processing.

To support highlighting, at index-time we create an inverted representation
for each field that has been marked as "highlightable", then serialize all the
inverted fields together in one blob (called, for no particularly good reason,
a "DocVector").  Effectively this is a miniature inverted-index containing a
single document.  The class which does the work is
Lucy::Index::HighlightWriter, and the relevant segment files are named
seg_NNN/highlight.ix and seg_NNN/highlight.dat.

At search time, we retrieve the single-document mini-inverted-index which
corresponds to each hit, and then use that information to determine what
portions of a given highlightable field matched.  Each Query subclass's
companion Compiler class implements a Highlight_Spans() method which returns
an array of Lucy::Search::Span objects.  If the field matched against the
document, the array returned by Highlight_Spans() will be non-empty, and
Highlighter uses those spans to choose the excerpt and highlight the relevant
sections.

Hey wait a minute...

It occurs to me that we might be able to fake up a prototype implementation
using the existing Highlight_Spans() functionality.  

Make sure that you spec every field as "highlightable".  Then, at search time,
try something like this:

    my $query = $query_parser->parse($query_string);
    my $compiler = $query->make_compiler(searcher => $searcher);
    my $hits = $searcher->hits;
    while (my $hit = $hits->next) {
        my $doc_vec = $searcher->fetch_doc_vec($hit->get_doc_id);
        my @relevant_fields;
        for my $field (@{ $schema->all_fields }) {
            my $spans = $compiler->highlight_spans(
                searcher => $searcher,
                doc_vec  => $doc_vec,
                field    => $field,
            );
            if (@$spans) {
                push @relevant_fields, $field;
            }
        }
        print "Relevant fields: ";
        print join ", ", @relevant_fields;
        print "\n";
    }

If a field produces highlight spans, it was relevant.  If it doesn't produce
highlight spans, it wasn't relevant.

Does that work?

Marvin Humphrey


Mime
View raw message