lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tomlinson <chris.j.tomlin...@gmail.com>
Subject Re: What is the proper use of stop words in Lucene?
Date Mon, 28 Apr 2014 19:46:26 GMT
Hi,

On Apr 28, 2014, at 11:45 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

>> Hello Uwe,
>> 
>> Thank you for the reply. I see that there is a version check for the use of
>> setEnablePositionIncrements(false); and, I think I may be able to use an
>> earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version
>> check.
> 
> Hi,
> 
> you don't need an older version of the Lucene library. It is enough to pass the constant,
also with Lucene 4.7 or 4.8 (release in a moment):
> sf = new StopFilter(Version.LUCENE_43, ...);
> sf. setEnablePositionIncrements (false);
> 
> The version constant is exactly to use some components that changed in an incompatible
way still in later versions, and preserve index/behavior compatibility.

Thank you for the explanation.


> About stop words: What you are doing, is not really "stop words". The main reason for
stop words is the following:
> - Stop words are in almost every document, so it makes no sense to query for them.

This was my understanding.


> - The only relevant information behind the stop word is "there was a word at this position
that"

I didn't realize that this was a "necessary" aspect. I can certainly understand that it may
be relevant in some (most) cases and it makes sense to me that it would appropriate to always
preserve the information in indexing. I was looking for a solution that would essentially
work at query time and had initially thought that the CommonQueryParserConfiguration#setEnablePositionIncrements()
was intended to work this way but it does not.


> If the second item would not be taken care, this information would get lost, too.
> 
> If every document really contains a specific stop word (which is almost always the case),
there must be no difference between a phrase query with mentioned stop word, using an index
with all stop words indexed and one with stop words left out. This can only be done, if the
stop word reserves a position.
> 
> What you intend to do is not a "stopword" use case. You want to "ignore" some words -
Lucene has no support for this, because in native language processing this makes no sense.

Thank you for the information. I was unaware that ignoring some words "makes no sense". I
thought I gave a reasonable example of exactly this situation in the native processing of
Tibetan. Perhaps I am still not understanding.


> One way to do this is to:
> a) write your own TokenFilter, violating the TokenStream contracts
> b) use the Backwards compatibility layer with matchVersion=LUCENE_43
> c) maybe remove the words before tokenizing (e.g. MappingCharFilter, mapping the "ignore
words" to empty string)

Thank you for these useful approaches to solving the use case.

ciao,
Chris



> 
> Uwe
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message