Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 56021 invoked from network); 13 Nov 2009 22:16:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Nov 2009 22:16:38 -0000 Received: (qmail 20337 invoked by uid 500); 13 Nov 2009 22:16:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 20253 invoked by uid 500); 13 Nov 2009 22:16:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 20243 invoked by uid 99); 13 Nov 2009 22:16:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2009 22:16:36 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jake.mannix@gmail.com designates 209.85.216.186 as permitted sender) Received: from [209.85.216.186] (HELO mail-px0-f186.google.com) (209.85.216.186) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2009 22:16:33 +0000 Received: by pxi16 with SMTP id 16so2624256pxi.29 for ; Fri, 13 Nov 2009 14:16:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=PBRx13pQZeK34sMsnnzPPek8wPJIVv72h33iK3F0l2c=; b=IYr5te5dm5pvjwgwuMWCZeLQKaEweP6ldJympT9OjpoJVsLW+CxVb20X5MEdpRxDkh GearOW0y2iBtr1RZS9SFOLDmzlczBy8G2ajA2pKnWYnWTXyd+6hHk395sXe0RU/Q1Khd g8ZiHH3Px+VEReEKGBmFxZa1uIrkrSLBvlux4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=GIGDP5CI9WRH2SdMbkqNAXxm+eo7v4CTd6YJqOzAzG4xkaqE50AMOuCsdZvwNLiaWM 2VBDj0l4URhwEBhaZLhJ2C5W5rdDRtRNzQCZ0lPetWGx07XDZuv5+2tJZlrnUMeHhutG ijNEyssGnnxt1bkfPV1hg4DPOLF4P5AXbQbLY= MIME-Version: 1.0 Received: by 10.142.7.38 with SMTP id 38mr521470wfg.339.1258150573801; Fri, 13 Nov 2009 14:16:13 -0800 (PST) In-Reply-To: <3836ec640911131409p6c0fc26bs9b77429889da55ec@mail.gmail.com> References: <3836ec640911131409p6c0fc26bs9b77429889da55ec@mail.gmail.com> Date: Fri, 13 Nov 2009 14:16:13 -0800 Message-ID: <4b124c310911131416q1ddbe7a5y2e3372ec4d2601e3@mail.gmail.com> Subject: Re: Term Boost Threshold From: Jake Mannix To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502ae8e0682b1047848023f --00504502ae8e0682b1047848023f Content-Type: text/plain; charset=ISO-8859-1 Hi Max, You want a query like ("San Francisco" OR "California") AND ("John Smith" OR "John Smith Manufacturing") essentially? You can give Lucene exactly this query and it will require that either "John Smith" or "John Smith Manufacturing" be present, but will score results which have these and one or more of San Fran or CA higher. And in fact will score highest results which match all terms. Does that help? -jake On Fri, Nov 13, 2009 at 2:09 PM, Max Lynch wrote: > Hi, > I am trying to move from a system where I counted the frequency of terms by > hand in a highlighter to determine if a result was useful to me. In an > earlier post on this list someone suggested I could boost the terms that > are > useful to me and only accept hits above a certain threshold. However, in > my > tests, I can't seem to find a deterministic way of calculating a threshold. > > Here is an example of what I mean: > My query: "John Smith" "John Smith Manufacturing" "San Francisco" > "California" > > Results are only useful to me if they contain the first term "John Smith" > and/or the second term "John Smith Manufacturing" or any combination with > the other San Fran and California terms. However, results with just "San > Francisco" or "California" can be ignored. > > I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San > Francisco"^2 "California"^1 > > But I can't seem to find a good method of calculating a cut-off score and > filtering out the results that are only San Fran or California using the > term boosting and resulting score. I also don't care about frequency, > meaning that I want the result even if John Smith occurs once, and I don't > want a document with "San Francisco" a million times to score higher than > the single result for John Smith. > > Sorry if that's confusing. > > Any ideas? > > Thanks, > Max > --00504502ae8e0682b1047848023f--