Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 41325 invoked from network); 18 Jan 2002 19:01:52 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 18 Jan 2002 19:01:52 -0000 Received: (qmail 29079 invoked by uid 97); 18 Jan 2002 19:01:51 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 29049 invoked by uid 97); 18 Jan 2002 19:01:50 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 29038 invoked from network); 18 Jan 2002 19:01:50 -0000 Message-ID: <3C487180.9010900@earthlink.net> Date: Fri, 18 Jan 2002 12:03:28 -0700 From: Dmitry Serebrennikov User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.7) Gecko/20011221 X-Accept-Language: en-us MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Relevance boosting with the aid of semantic markup References: <4BC270C6AB8AD411AD0B00B0D0493DF0EE7D5C@mail.grandcentral.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Doug Cutting wrote: >I made a proposal a while back which could also be used to achieve this. It >is not the most elegant solution, but a solution nonetheless. > >The proposal was to add a field to Token, as follows: > private int positionIncrement = 1; > public int getPositionIncrement() { return positionIncrement; } > public void setPositionIncrement(int pi) { > if (pi < 0) > throw IllegalArgumentException("positionIncrment cannot be negative"); > positionIncrement = pi; > } > >This would be used when indexing to determine a token's position relative to >the previous token in the stream, for the purposes of phrase searching, as >in the following diff: > >--- DocumentWriter.java 2001/09/18 16:29:52 1.1.1.1 >+++ DocumentWriter.java 2001/12/06 16:24:34 >@@ -159,7 +159,8 @@ > TokenStream stream = analyzer.tokenStream(fieldName, reader); > try { > for (Token t = stream.next(); t != null; t = stream.next()) { >- addPosition(fieldName, t.termText(), position++); >+ addPosition(fieldName, t.termText(), position); >+ position += t.getPositionIncrement(); > if (position > maxFieldLength) break; > } > >A common use would be for an analyzer to set positionIncrement to zero for >some tokens, so that these tokens logically occupy the same position as the >previous token. This would be useful for stemmers that map surface forms to >multiple stems, a common thing in some languages. Another use would be to >have a stop word filter set it to values greater than one, so that phrases >would not match over stop words, which is desirable to some folks. > This sounds really interesting! We have a problem where some documents might say "4-pack" and some "4pack". We want to make it so that users can find these documents by typing "4-pack" or "4 pack" or "4pack". The only to deal with this seemed to have an anlyzer that does not simply split "4-pack" into "4" and "pack" but adds another term "4pack", but that would have broken the proximity numbers. It sounds like the proposed change could allow this analyzer behavior without affecting proximit? That would be great! -- To unsubscribe, e-mail: For additional commands, e-mail: