lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <>
Subject Re: interesting phrase query issue
Date Tue, 22 Jul 2003 18:11:26 GMT

I include a patch below to which should inhibit exact phrase
matching across a removed stopword.

[Would it be useful for this to be the default behavior?]

See the API docs for Token.setPositionIncrement(int):

"Set [the position increment] to values greater than one to inhibit exact phrase
matches. If, for example, one does not want phrases to match across removed stop
words, then one could build a stop word filter that removes stop words and also
sets the increment to the number of stop words removed before each non-stop
word. Then exact phrase queries will only match when the terms occur with no
intervening stop words."

--------->8-----------cut here--------->8-----------
RCS file:
retrieving revision 1.3
diff -r1.3
 >   private int       positionIncrement = 1;
<     for (Token token =; token != null; token =
<       if (table.get(token.termText) == null)
 >     for (Token token =; token != null; token = {
 >       if (table.get(token.termText) == null) {
 >         token.setPositionIncrement(positionIncrement);
 >         positionIncrement = 1; // reset the position increment
 >       } else {
 >         ++positionIncrement;  // stopword -- increase the position increment
 >       }
 >     }
--------->8-----------cut here--------->8-----------

greg wrote:
 > I have several document sections that are being indexed via the
 > StandardAnalyzer.  One of these documents has the line "access, the
 > manager".  When searching for the phrase "access manager", this document
 > is being returned.  I understand why (at least i think i do), because a
 > stop word is "the" and the "," is being removed by the tokenizer, my
 > question is is there any way I can avoid having this returned in the
 > results?  My thoughts were to create a new analyzer that indexes the
 > word "the" (blick to many of those), or index the "," in some way (also
 > not good).  Any suggestions?
 > Thanks,
 > Greg T Robertson

Steve Rowe
Software Engineer
Center for Natural Language Processing
School of Information Studies
Syracuse University

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message