lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Relevance boosting with the aid of semantic markup
Date Thu, 06 Dec 2001 16:53:55 GMT
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> 
> Anyway, a possible solution would be to add the ability of add a
> 'boost-factor' to each token so that the Scorer can perform 
> hits rating
> based on this information (the search phase could not be influenced by
> this boost factors).

A simple approach is to add emphasized terms to a separate field, and always
search for terms in both the normal field and the emphasized field.  Because
the emphasized field is shorter, matches in it boost scores more than those
in the normal field, in the same way that "title" matches are stronger than
"body" matches.

I made a proposal a while back which could also be used to achieve this.  It
is not the most elegant solution, but a solution nonetheless.

The proposal was to add a field to Token, as follows:
  private int positionIncrement = 1;
  public int getPositionIncrement() { return positionIncrement; }
  public void setPositionIncrement(int pi) {
    if (pi < 0)
      throw IllegalArgumentException("positionIncrment cannot be negative");
    positionIncrement = pi;
  }

This would be used when indexing to determine a token's position relative to
the previous token in the stream, for the purposes of phrase searching, as
in the following diff:

--- DocumentWriter.java	2001/09/18 16:29:52	1.1.1.1
+++ DocumentWriter.java	2001/12/06 16:24:34
@@ -159,7 +159,8 @@
 	  TokenStream stream = analyzer.tokenStream(fieldName, reader);
 	  try {
 	    for (Token t = stream.next(); t != null; t = stream.next()) {
-	      addPosition(fieldName, t.termText(), position++);
+	      addPosition(fieldName, t.termText(), position);
+              position += t.getPositionIncrement();
 	      if (position > maxFieldLength) break;
 	    }

A common use would be for an analyzer to set positionIncrement to zero for
some tokens, so that these tokens logically occupy the same position as the
previous token.  This would be useful for stemmers that map surface forms to
multiple stems, a common thing in some languages.  Another use would be to
have a stop word filter set it to values greater than one, so that phrases
would not match over stop words, which is desirable to some folks.

In your case, positionIncrement could be used to repeat an emphasized term
at the same position to boost its frequency without adversely affecting
phrase search results.  (However the index would get slightly larger, and
the searches slightly slower.)

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message