lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "greg" <g...@webpipe.net>
Subject Re: interesting phrase query issue
Date Thu, 24 Jul 2003 12:46:36 GMT
Steve, 

This is great stuff.  IMHO this should be the default behaviour - especially 
coming from someone who is new to Lucene.  I can not tell you how surprised 
i was when someone searched for a phrase and it was returned  but did not 
exist in the document except as "access, the manager".  Your patch really 
did the job. 

Thanks again, 

Greg T Robertson 

Steve Rowe writes: 

> Greg, 
> 
> I include a patch below to StopFilter.java which should inhibit exact 
> phrase
> matching across a removed stopword. 
> 
> [Would it be useful for this to be the default behavior?] 
> 
> See the API docs for Token.setPositionIncrement(int): 
> 
> <URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/ 
> Token.html#setPositionIncrement(int)>
> "Set [the position increment] to values greater than one to inhibit exact 
> phrase
> matches. If, for example, one does not want phrases to match across 
> removed stop
> words, then one could build a stop word filter that removes stop words and 
> also
> sets the increment to the number of stop words removed before each 
> non-stop
> word. Then exact phrase queries will only match when the terms occur with 
> no
> intervening stop words." 
> 
> --------->8-----------cut here--------->8-----------
> Index: StopFilter.java
> ===================================================================
> RCS file:
> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFil 
> ter.java,v
> retrieving revision 1.3
> diff -r1.3 StopFilter.java
> 64a65
> >   private int       positionIncrement = 1;
> 93,94c94,97
> <     for (Token token = input.next(); token != null; token = 
> input.next())
> <       if (table.get(token.termText) == null)
> ---
> >     for (Token token = input.next(); token != null; token = 
> input.next()) {
> >       if (table.get(token.termText) == null) {
> >         token.setPositionIncrement(positionIncrement);
> >         positionIncrement = 1; // reset the position increment
> 95a99,102
> >       } else {
> >         ++positionIncrement;  // stopword -- increase the position 
> increment
> >       }
> >     }
> --------->8-----------cut here--------->8----------- 
> 
> greg wrote:
> > I have several document sections that are being indexed via the
> > StandardAnalyzer.  One of these documents has the line "access, the
> > manager".  When searching for the phrase "access manager", this document
> > is being returned.  I understand why (at least i think i do), because a
> > stop word is "the" and the "," is being removed by the tokenizer, my
> > question is is there any way I can avoid having this returned in the
> > results?  My thoughts were to create a new analyzer that indexes the
> > word "the" (blick to many of those), or index the "," in some way (also
> > not good).  Any suggestions?
> > Thanks,
> > Greg T Robertson 
> 
> -- 
> Steve Rowe
> Software Engineer
> Center for Natural Language Processing
> School of Information Studies
> Syracuse University
> www.cnlp.org 
> 
>  
> 
>  
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org 
> 
 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message