lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <>
Subject Re: Relevance boosting with the aid of semantic markup
Date Fri, 07 Dec 2001 07:45:31 GMT
Doug Cutting wrote:

>>From: Stefano Mazzocchi []
>>Anyway, a possible solution would be to add the ability of add a
>>'boost-factor' to each token so that the Scorer can perform 
>>hits rating
>>based on this information (the search phase could not be influenced by
>>this boost factors).
>A simple approach is to add emphasized terms to a separate field, and always
>search for terms in both the normal field and the emphasized field.  Because
>the emphasized field is shorter, matches in it boost scores more than those
>in the normal field, in the same way that "title" matches are stronger than
>"body" matches.
>I made a proposal a while back which could also be used to achieve this.  It
>is not the most elegant solution, but a solution nonetheless.
Why do you say this is not elegant?

>The proposal was to add a field to Token, as follows:
>  private int positionIncrement = 1;
>  public int getPositionIncrement() { return positionIncrement; }
>  public void setPositionIncrement(int pi) {
>    if (pi < 0)
>      throw IllegalArgumentException("positionIncrment cannot be negative");
>    positionIncrement = pi;
>  }
>This would be used when indexing to determine a token's position relative to
>the previous token in the stream, for the purposes of phrase searching, as
>in the following diff:
>---	2001/09/18 16:29:52
>+++	2001/12/06 16:24:34
>@@ -159,7 +159,8 @@
> 	  TokenStream stream = analyzer.tokenStream(fieldName, reader);
> 	  try {
> 	    for (Token t =; t != null; t = {
>-	      addPosition(fieldName, t.termText(), position++);
>+	      addPosition(fieldName, t.termText(), position);
>+              position += t.getPositionIncrement();
> 	      if (position > maxFieldLength) break;
> 	    }
>A common use would be for an analyzer to set positionIncrement to zero for
>some tokens, so that these tokens logically occupy the same position as the
>previous token.  This would be useful for stemmers that map surface forms to
>multiple stems, a common thing in some languages.  Another use would be to
>have a stop word filter set it to values greater than one, so that phrases
>would not match over stop words, which is desirable to some folks.
I'd love this feature. I would like to put on the same position:
1. the original word
2. the stem(s) of the word (if stem(word) != word)
3. if lowercase(word) != (word), then lowercase(word)
(I don't like analyzers that gives back only lowercased words)

>In your case, positionIncrement could be used to repeat an emphasized term
>at the same position to boost its frequency without adversely affecting
>phrase search results.  (However the index would get slightly larger, and
>the searches slightly slower.)
Why can't we store some value of each word. If I could index the stems 
of the words as well, I gave lower value to them.
I know a Russion search engine that uses 3 (or 4 I don't remember) 
distinct value to classify each term in the index:
1. original word
2. stem
3. spam

The priority of the terms is calculated at indexing time and used for 

Some similar solution made Lucene much more flexible and ready for high 
quality infomational retrieval systems. Yes, I know the index would get 
more larger.

(In Hungarian language indexing the stem and the original form of the 
word helps a lot)



To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message