lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <peterlkee...@gmail.com>
Subject Re: BoostingTermQuery scoring
Date Fri, 07 Nov 2008 18:32:24 GMT
 boost:(+petroleum +engineer +refinery)
 (+contents:(+petroleum +engineer +refinery)
  +((*:* -boost:petroleum)
    (*:* -boost:engineer)
    (*:* -boost:refinery)))

That's an interesting solution. Would this result in many more documents
being visited by the scorer, possibly impacting performance? (I haven't
tried it yet).

Thanks,
Peter



On Thu, Nov 6, 2008 at 6:56 PM, Steven A Rowe <sarowe@syr.edu> wrote:

> Hi Peter,
>
> On 11/06/2008 at 4:25 PM, Peter Keegan wrote:
> > I've discovered another flaw in using this technique:
> >
> > (+contents:petroleum +contents:engineer +contents:refinery)
> > (+boost:petroleum +boost:engineer +boost:refinery)
> >
> > It's possible that the first clause will produce a matching
> > doc and none of the terms in the second clause are used to
> > score that doc. Yet another reason to use BoostingTermQuery.
>
> I think you could address this, without BTQ, using something like:
>
>  boost:(+petroleum +engineer +refinery)
>  (+contents:(+petroleum +engineer +refinery)
>   +((*:* -boost:petroleum)
>     (*:* -boost:engineer)
>     (*:* -boost:refinery)))
>
> The last three lines gives you the set of documents that are missing at
> least one of the terms in the "boost" field.  The *:* thingy, indicating a
> MatchAllDocsQuery, is necessary to get all documents that don't have a given
> term; Lucene's (sub-)query document exclusion operation needs a non-empty
> set on which to operate.
>
> On 11/06/2008 at 1:08 PM, Peter Keegan wrote:
> > Then, at search time, a query for "petroleum engineer" gets rewritten
> > to: (+contents:petroleum +contents:engineer) (+boost:petroleum
> > +boost:engineer). Note that the two clauses are OR'd so that a term that
> > exists in both fields will get a higher weight in the 'boost' field.
> > This works quite well at boosting documents with terms that exist in the
> > boosted fields. However, it doesn't work properly if excluded terms are
> > added, for example:
> >
> > (+contents:petroleum +contents:engineer -contents:drilling)
> > (+boost:petroleum +boost:engineer -boost:drilling)
> >
> > If a document contains the term 'drilling' in the 'body'
> > field, but not in the 'title' or 'city' field, a false hit occurs.
>
> I think you could address this problem like this:
>
>  +(boost:(+petroleum +engineer)
>    (+contents:(+petroleum +engineer)
>     +((*:* -boost:petroleum)
>       (*:* -boost:engineer))))
>  -contents:drilling
>
> You don't have to include "-boost:drilling", because this condition is
> entailed by "-contents:drilling".
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message