incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Highlighting problem with latest trunk
Date Thu, 10 Nov 2011 22:53:19 GMT
On Thu, Nov 10, 2011 at 11:31:31AM +0200, goran kent wrote:
> On 11/10/11, goran kent <gorankent@gmail.com> wrote:
> > Completing each new() requires 4s *each*.  Somehow I don't recall this
> > being the case before :/
> 
> The 4s x 2 penalty is probably related to the remote searching (of
> which highlighting is a part) doing things serially and not
> concurrently, and not this particular bug, no?

Correct, that is the most important factor.  There are likely a lot of remote
doc_freq() calls bouncing around, and those calls are being executed serially
by PolySearcher.

Highlighter requires a weighted query -- in Lucy-speak, a
Lucy::Search::Compiler[1] -- in order to determine both which parts of the
field matched and how much they contributed to the score.  In order to weight
a query, you need to know how common each term is, so that in a search for
'the metamorphosis' the term 'metamorphosis' contributes more to the score
than the term 'the'.  In the context of Highlighter, we need to know that
'metamorphosis' is more important than 'the' so that we can prefer
selections which contain 'metamorphosis' over selections which contain 'the'
when choosing an excerpt.

However, having Highlighter weight the query means duplicated work, because
the main Searcher has to perform exactly the same weighting routine prior to
asking the remote nodes to score results.

We can eliminate the duplicated effort by performing the weighting manually
and supplying the weighted query object to both Searcher#hits and
Highlighter#new.  (You'll need the latest trunk because this sample code
requires a patch from LUCY-188 I committed earlier today.)

    my $query    = $query_parser->parse($query_string);
    my $compiler = $query->make_compiler(searcher => $searcher);
    my $hits     = $searcher->hits(query => $compiler);
    my $highlighter = Lucy::Highlight::Highlighter->new(
        query    => $compiler,
        searcher => $searcher,
        field    => 'content',
    );

You may be able to cut down those remote doc_freq() calls further by using a
QueryParser which has the minimum possible number of fields.  I can go into
depth on that in another email if you like.

Marvin Humphrey

[1] It's called a "Compiler" because it's primary role is to compile a Query
    to a Matcher.  Nobody likes the name, but we haven't achieved consensus on
    what to do about it.


Mime
View raw message