Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <3C487180.9010900@earthlink.net>
Date: Fri, 18 Jan 2002 12:03:28 -0700
From: Dmitry Serebrennikov <dmitrys@earthlink.net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:0.9.7) Gecko/20011221
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: Relevance boosting with the aid of semantic markup
References: <4BC270C6AB8AD411AD0B00B0D0493DF0EE7D5C@mail.grandcentral.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Doug Cutting wrote:

>I made a proposal a while back which could also be used to achieve this.  It
>is not the most elegant solution, but a solution nonetheless.
>
>The proposal was to add a field to Token, as follows:
>  private int positionIncrement = 1;
>  public int getPositionIncrement() { return positionIncrement; }
>  public void setPositionIncrement(int pi) {
>    if (pi < 0)
>      throw IllegalArgumentException("positionIncrment cannot be negative");
>    positionIncrement = pi;
>  }
>
>This would be used when indexing to determine a token's position relative to
>the previous token in the stream, for the purposes of phrase searching, as
>in the following diff:
>
>--- DocumentWriter.java	2001/09/18 16:29:52	1.1.1.1
>+++ DocumentWriter.java	2001/12/06 16:24:34
>@@ -159,7 +159,8 @@
> 	  TokenStream stream = analyzer.tokenStream(fieldName, reader);
> 	  try {
> 	    for (Token t = stream.next(); t != null; t = stream.next()) {
>-	      addPosition(fieldName, t.termText(), position++);
>+	      addPosition(fieldName, t.termText(), position);
>+              position += t.getPositionIncrement();
> 	      if (position > maxFieldLength) break;
> 	    }
>
>A common use would be for an analyzer to set positionIncrement to zero for
>some tokens, so that these tokens logically occupy the same position as the
>previous token.  This would be useful for stemmers that map surface forms to
>multiple stems, a common thing in some languages.  Another use would be to
>have a stop word filter set it to values greater than one, so that phrases
>would not match over stop words, which is desirable to some folks.
>
This sounds really interesting! We have a problem where some documents 
might say "4-pack" and some "4pack". We want to make it so that users 
can find these documents by typing "4-pack" or "4 pack" or "4pack". The 
only to deal with this seemed to have an anlyzer that does not simply 
split "4-pack" into "4" and "pack" but adds another term "4pack", but 
that would have broken the proximity numbers. It sounds like the 
proposed change could allow this analyzer behavior without affecting 
proximit? That would be great!


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>