Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of jake.mannix@gmail.com
 designates 209.85.216.186 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=GIGDP5CI9WRH2SdMbkqNAXxm+eo7v4CTd6YJqOzAzG4xkaqE50AMOuCsdZvwNLiaWM
         2VBDj0l4URhwEBhaZLhJ2C5W5rdDRtRNzQCZ0lPetWGx07XDZuv5+2tJZlrnUMeHhutG
         ijNEyssGnnxt1bkfPV1hg4DPOLF4P5AXbQbLY=
MIME-Version: 1.0
In-Reply-To: <3836ec640911131409p6c0fc26bs9b77429889da55ec@mail.gmail.com>
References: <3836ec640911131409p6c0fc26bs9b77429889da55ec@mail.gmail.com>
Date: Fri, 13 Nov 2009 14:16:13 -0800
Message-ID: <4b124c310911131416q1ddbe7a5y2e3372ec4d2601e3@mail.gmail.com>
Subject: Re: Term Boost Threshold
From: Jake Mannix <jake.mannix@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=00504502ae8e0682b1047848023f

--00504502ae8e0682b1047848023f
Content-Type: text/plain; charset=ISO-8859-1

Hi Max,

  You want a query like

("San Francisco" OR "California") AND ("John Smith" OR "John Smith
Manufacturing")

  essentially?  You can give Lucene exactly this query and it will require
that
either "John Smith" or "John Smith Manufacturing" be present, but will score
results which have these and one or more of San Fran or CA higher.  And in
fact will score highest results which match all terms.

  Does that help?

  -jake

On Fri, Nov 13, 2009 at 2:09 PM, Max Lynch <ihasmax@gmail.com> wrote:

> Hi,
> I am trying to move from a system where I counted the frequency of terms by
> hand in a highlighter to determine if a result was useful to me.  In an
> earlier post on this list someone suggested I could boost the terms that
> are
> useful to me and only accept hits above a certain threshold.  However, in
> my
> tests, I can't seem to find a deterministic way of calculating a threshold.
>
> Here is an example of what I mean:
> My query: "John Smith" "John Smith Manufacturing" "San Francisco"
> "California"
>
> Results are only useful to me if they contain the first term "John Smith"
> and/or the second term "John Smith Manufacturing" or any combination with
> the other San Fran and California terms.  However, results with just "San
> Francisco" or "California" can be ignored.
>
> I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San
> Francisco"^2 "California"^1
>
> But I can't seem to find a good method of calculating a cut-off score and
> filtering out the results that are only San Fran or California using the
> term boosting and resulting score.  I also don't care about frequency,
> meaning that I want the result even if John Smith occurs once, and I don't
> want a document with "San Francisco" a million times to score higher than
> the single result for John Smith.
>
> Sorry if that's confusing.
>
> Any ideas?
>
> Thanks,
> Max
>

--00504502ae8e0682b1047848023f--