Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4BE9976F6 for ; Fri, 11 Nov 2011 07:05:10 +0000 (UTC) Received: (qmail 99946 invoked by uid 500); 11 Nov 2011 07:05:10 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 99872 invoked by uid 500); 11 Nov 2011 07:05:07 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 99864 invoked by uid 99); 11 Nov 2011 07:05:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Nov 2011 07:05:06 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gorankent@gmail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Nov 2011 07:05:00 +0000 Received: by vcbfl11 with SMTP id fl11so1206201vcb.6 for ; Thu, 10 Nov 2011 23:04:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=t0AFUQKiG6njuZhctuWlA9XshK17hGuVTyo4ydXmC0M=; b=DAfc4xaKgrwrylIMdAl4kBspeqIWWZjjJCzd6W/lRQhJmq90PiiMmBBIo/As3FoQ20 OKWCLdNmxjDaDRMDjsiWsFh+5W3LEjYTQ2D6KWZbfZ3nIjCzTxLl+tNeuzjgBNLaE03f cJXtofMya8s10QqNQGFH2EM/XOxbxwV6MX8wA= MIME-Version: 1.0 Received: by 10.52.25.107 with SMTP id b11mr18643391vdg.75.1320995079088; Thu, 10 Nov 2011 23:04:39 -0800 (PST) Received: by 10.52.188.10 with HTTP; Thu, 10 Nov 2011 23:04:38 -0800 (PST) In-Reply-To: <20111110225319.GA25189@rectangular.com> References: <20111109160921.GA8356@rectangular.com> <20111110225319.GA25189@rectangular.com> Date: Fri, 11 Nov 2011 09:04:38 +0200 Message-ID: From: goran kent To: lucy-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-user] Highlighting problem with latest trunk On Fri, Nov 11, 2011 at 12:53 AM, Marvin Humphrey wrote: > Correct, that is the most important factor. =A0There are likely a lot of = remote > doc_freq() calls bouncing around, and those calls are being executed seri= ally > by PolySearcher. > > Highlighter requires a weighted query -- in Lucy-speak, a > Lucy::Search::Compiler[1] -- in order to determine both which parts of th= e > field matched and how much they contributed to the score. =A0In order to = weight > a query, you need to know how common each term is, so that in a search fo= r > 'the metamorphosis' the term 'metamorphosis' contributes more to the scor= e > than the term 'the'. =A0In the context of Highlighter, we need to know th= at > 'metamorphosis' is more important than 'the' so that we can prefer > selections which contain 'metamorphosis' over selections which contain 't= he' > when choosing an excerpt. > > However, having Highlighter weight the query means duplicated work, becau= se > the main Searcher has to perform exactly the same weighting routine prior= to > asking the remote nodes to score results. > > We can eliminate the duplicated effort by performing the weighting manual= ly > and supplying the weighted query object to both Searcher#hits and > Highlighter#new. =A0(You'll need the latest trunk because this sample cod= e > requires a patch from LUCY-188 I committed earlier today.) > > =A0 =A0my $query =A0 =A0=3D $query_parser->parse($query_string); > =A0 =A0my $compiler =3D $query->make_compiler(searcher =3D> $searcher); > =A0 =A0my $hits =A0 =A0 =3D $searcher->hits(query =3D> $compiler); > =A0 =A0my $highlighter =3D Lucy::Highlight::Highlighter->new( > =A0 =A0 =A0 =A0query =A0 =A0=3D> $compiler, > =A0 =A0 =A0 =A0searcher =3D> $searcher, > =A0 =A0 =A0 =A0field =A0 =A0=3D> 'content', > =A0 =A0); > > You may be able to cut down those remote doc_freq() calls further by usin= g a > QueryParser which has the minimum possible number of fields. =A0I can go = into > depth on that in another email if you like. > > Marvin Humphrey > > [1] It's called a "Compiler" because it's primary role is to compile a Qu= ery > =A0 =A0to a Matcher. =A0Nobody likes the name, but we haven't achieved co= nsensus on > =A0 =A0what to do about it. Crikey, I'm so accustomed to the oftentimes terse replies from FOSS devs that receiving such detailed and obviously time-consuming-to-compile replies leaves me staring at the headlights dumbly; and I dribbled a bit there. ok, back to Lucy - that would explain the herds of mysterious doc_freq's I was seeing in my remote debug prints. At the time, it was like, "what the hell?" The dim light of comprehension blinks on, well, dimly. Let me chew on this stuff a bit and I'll report back with some results.