incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From goran kent <gorank...@gmail.com>
Subject Re: [lucy-user] Highlighting problem with latest trunk
Date Fri, 11 Nov 2011 07:04:38 GMT
On Fri, Nov 11, 2011 at 12:53 AM, Marvin Humphrey
<marvin@rectangular.com> wrote:
> Correct, that is the most important factor.  There are likely a lot of remote
> doc_freq() calls bouncing around, and those calls are being executed serially
> by PolySearcher.
>
> Highlighter requires a weighted query -- in Lucy-speak, a
> Lucy::Search::Compiler[1] -- in order to determine both which parts of the
> field matched and how much they contributed to the score.  In order to weight
> a query, you need to know how common each term is, so that in a search for
> 'the metamorphosis' the term 'metamorphosis' contributes more to the score
> than the term 'the'.  In the context of Highlighter, we need to know that
> 'metamorphosis' is more important than 'the' so that we can prefer
> selections which contain 'metamorphosis' over selections which contain 'the'
> when choosing an excerpt.
>
> However, having Highlighter weight the query means duplicated work, because
> the main Searcher has to perform exactly the same weighting routine prior to
> asking the remote nodes to score results.
>
> We can eliminate the duplicated effort by performing the weighting manually
> and supplying the weighted query object to both Searcher#hits and
> Highlighter#new.  (You'll need the latest trunk because this sample code
> requires a patch from LUCY-188 I committed earlier today.)
>
>    my $query    = $query_parser->parse($query_string);
>    my $compiler = $query->make_compiler(searcher => $searcher);
>    my $hits     = $searcher->hits(query => $compiler);
>    my $highlighter = Lucy::Highlight::Highlighter->new(
>        query    => $compiler,
>        searcher => $searcher,
>        field    => 'content',
>    );
>
> You may be able to cut down those remote doc_freq() calls further by using a
> QueryParser which has the minimum possible number of fields.  I can go into
> depth on that in another email if you like.
>
> Marvin Humphrey
>
> [1] It's called a "Compiler" because it's primary role is to compile a Query
>    to a Matcher.  Nobody likes the name, but we haven't achieved consensus on
>    what to do about it.

Crikey, I'm so accustomed to the oftentimes terse replies from FOSS
devs that receiving such detailed and obviously
time-consuming-to-compile replies leaves me staring at the headlights
dumbly;  and I dribbled a bit there.

ok, back to Lucy - that would explain the herds of mysterious
doc_freq's I was seeing in my remote debug prints.  At the time, it
was like, "what the hell?"  The dim light of comprehension blinks on,
well, dimly.

Let me chew on this stuff a bit and I'll report back with some results.

Mime
View raw message