Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CEB1696C9 for ; Thu, 10 Nov 2011 22:53:22 +0000 (UTC) Received: (qmail 48166 invoked by uid 500); 10 Nov 2011 22:53:22 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 48142 invoked by uid 500); 10 Nov 2011 22:53:22 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 48134 invoked by uid 99); 10 Nov 2011 22:53:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Nov 2011 22:53:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Nov 2011 22:53:16 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1ROdUl-0006Yr-AI for lucy-user@incubator.apache.org; Thu, 10 Nov 2011 14:53:19 -0800 Date: Thu, 10 Nov 2011 14:53:19 -0800 From: Marvin Humphrey To: lucy-user@incubator.apache.org Message-ID: <20111110225319.GA25189@rectangular.com> References: <20111109160921.GA8356@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Subject: Re: [lucy-user] Highlighting problem with latest trunk On Thu, Nov 10, 2011 at 11:31:31AM +0200, goran kent wrote: > On 11/10/11, goran kent wrote: > > Completing each new() requires 4s *each*. Somehow I don't recall this > > being the case before :/ > > The 4s x 2 penalty is probably related to the remote searching (of > which highlighting is a part) doing things serially and not > concurrently, and not this particular bug, no? Correct, that is the most important factor. There are likely a lot of remote doc_freq() calls bouncing around, and those calls are being executed serially by PolySearcher. Highlighter requires a weighted query -- in Lucy-speak, a Lucy::Search::Compiler[1] -- in order to determine both which parts of the field matched and how much they contributed to the score. In order to weight a query, you need to know how common each term is, so that in a search for 'the metamorphosis' the term 'metamorphosis' contributes more to the score than the term 'the'. In the context of Highlighter, we need to know that 'metamorphosis' is more important than 'the' so that we can prefer selections which contain 'metamorphosis' over selections which contain 'the' when choosing an excerpt. However, having Highlighter weight the query means duplicated work, because the main Searcher has to perform exactly the same weighting routine prior to asking the remote nodes to score results. We can eliminate the duplicated effort by performing the weighting manually and supplying the weighted query object to both Searcher#hits and Highlighter#new. (You'll need the latest trunk because this sample code requires a patch from LUCY-188 I committed earlier today.) my $query = $query_parser->parse($query_string); my $compiler = $query->make_compiler(searcher => $searcher); my $hits = $searcher->hits(query => $compiler); my $highlighter = Lucy::Highlight::Highlighter->new( query => $compiler, searcher => $searcher, field => 'content', ); You may be able to cut down those remote doc_freq() calls further by using a QueryParser which has the minimum possible number of fields. I can go into depth on that in another email if you like. Marvin Humphrey [1] It's called a "Compiler" because it's primary role is to compile a Query to a Matcher. Nobody likes the name, but we haven't achieved consensus on what to do about it.