lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Highlighter, Term Positions and Stopwords
Date Tue, 06 Dec 2005 10:47:17 GMT

On Dec 5, 2005, at 11:32 PM, Dan Climan wrote:
> Do stopfilters create non-contiguous token positions?

No, not currently.  StopFilter leaves token positions in their  
original state, which defaults to contiguous (offset of 1).

There is an open issue to change this behavior though, and at one  
point I changed it temporarily but it caused issues with PhraseQuery  
and QueryParser.  PhraseQuery now supports term positions, and  
QueryParser also supports setting the PhraseQuery term positions  
appropriately.  So perhaps it is time to change StopFilter, or  
perhaps make it an optional feature.

I like the idea of leaving holes in the token positions so there is a  
more accurate picture of the original text so that phrase queries can  
avoid matching across where stop words were removed unless some slop  
is specified.

> The javadocs for this method note that:
>
> tokenPositionsGuaranteedContiguous - true if the token position  
> numbers have
> no overlaps or gaps.

You will want this to be set true.

> I was curious if a stopwords, by definition meant that tokens were not
> contiguous? Is this still true if the the query uses the same  
> analyzer and
> filters out the same stopwords?

Currently tokens are contiguous by all built-in analyzers, regardless  
of any tokens that may have been removed.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message