lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin H. Johnson" <robb...@orbis-terrarum.net>
Subject Re: Statistaical evaluation of modifications to a Lucene query based on search logs
Date Thu, 04 May 2006 21:11:47 GMT
On Thu, May 04, 2006 at 10:52:46AM -0400, Daniel Shane wrote:
> I'm developing a new type of Query, called a SubPhraseQuery. I have sent 
> a message to the list regarding this and Doug was kind enough to put me 
> on the right track. The query is simply a PhraseQuery where all terms 
> are search, but, if any of the subphrases are found, it boosts the 
> results the longer the subphrase is.
I can't help on the analyzing portion, but I can show you an alternative
implementation.

We use Lucene to power the search behind isohunt.com, and I came up with
a different way of doing what you want. It's got less in the way of
magic constants, and more in the way of using existing Lucene
functionality.

It's got one difference from yours, in that the terms are allowed to
occur in any order in the sub-phrases (so phrase "C B" from your
original example is scored like "B C").

If the query is a boolean query, it's a candidate for transmuting.
Otherwise it's just used as is.

/* Puesdo-code follows */
static Query transmuteBooleanQueryToSpanQuery(BooleanQuery bq)
1. Set required = get all terms with BOoleanClause.Occur.MUST.
2. Set optional = get all terms with BOoleanClause.Occur.SHOULD.
3. If the sum of the size of the two sets is <= 1, just return (safety case).
4. SpanTermQuery stq[] = (construct for a SpanTermQuery for each item in the above sets).
5. This is the bit of magic here: Define a value 'proximity' using the
   size of the sets above. We use required.size*3 + optional.size*2 + 5.
5. snq = new SpanNearQuery(stq,proximity,false);
6. bq.add(snq, BooleanClause.Occur.SHOULD);
7. return bq;

-- 
Robin Hugh Johnson
E-Mail     : robbat2@orbis-terrarum.net
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
ICQ#       : 30269588 or 41961639
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

Mime
View raw message