lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: Relevance boosting with the aid of semantic markup
Date Fri, 18 Jan 2002 19:03:28 GMT
Doug Cutting wrote:

>I made a proposal a while back which could also be used to achieve this.  It
>is not the most elegant solution, but a solution nonetheless.
>
>The proposal was to add a field to Token, as follows:
>  private int positionIncrement = 1;
>  public int getPositionIncrement() { return positionIncrement; }
>  public void setPositionIncrement(int pi) {
>    if (pi < 0)
>      throw IllegalArgumentException("positionIncrment cannot be negative");
>    positionIncrement = pi;
>  }
>
>This would be used when indexing to determine a token's position relative to
>the previous token in the stream, for the purposes of phrase searching, as
>in the following diff:
>
>--- DocumentWriter.java	2001/09/18 16:29:52	1.1.1.1
>+++ DocumentWriter.java	2001/12/06 16:24:34
>@@ -159,7 +159,8 @@
> 	  TokenStream stream = analyzer.tokenStream(fieldName, reader);
> 	  try {
> 	    for (Token t = stream.next(); t != null; t = stream.next()) {
>-	      addPosition(fieldName, t.termText(), position++);
>+	      addPosition(fieldName, t.termText(), position);
>+              position += t.getPositionIncrement();
> 	      if (position > maxFieldLength) break;
> 	    }
>
>A common use would be for an analyzer to set positionIncrement to zero for
>some tokens, so that these tokens logically occupy the same position as the
>previous token.  This would be useful for stemmers that map surface forms to
>multiple stems, a common thing in some languages.  Another use would be to
>have a stop word filter set it to values greater than one, so that phrases
>would not match over stop words, which is desirable to some folks.
>
This sounds really interesting! We have a problem where some documents 
might say "4-pack" and some "4pack". We want to make it so that users 
can find these documents by typing "4-pack" or "4 pack" or "4pack". The 
only to deal with this seemed to have an anlyzer that does not simply 
split "4-pack" into "4" and "pack" but adds another term "4pack", but 
that would have broken the proximity numbers. It sounds like the 
proposed change could allow this analyzer behavior without affecting 
proximit? That would be great!



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message