lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tomlinson <chris.j.tomlin...@gmail.com>
Subject What is the proper use of stop words in Lucene?
Date Wed, 23 Apr 2014 16:36:15 GMT
Hello,

I've written several times now on the list with this question / problem and no one has yet
replied so I don't know if the question is too wrong-headed or if there is simply no one reading
the list that can comment on the question.

The question that I'm trying to get answered is what is the correct way of ignoring stop word
gaps in Lucene 4.4+?

While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I think the question is
a proper Lucene question and really has nothing to do with the fact that we're using it in
an embedded manner.

The problem to be solved is how to ignore stop word gaps in queries - without the user having
to indicate where such gaps might occur at query time.

Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is not available.
None of the resources such as the "Lucene in Action" and so on explain how to use Lucene to
get the desired effect now that 4.4+ has removed the previous approach.

Prior to Lucene 4.4 it was possible to setEnablePositionIncrements(false) so that during indexing
and querying the number and position of stop word gaps would be ignored (as mentioned on pp
138-139 of "Lucene in Action").

This meant that a document with a phrase such as:

   blue is the sky

with stop words "is" and "the" would be selected by the query:

   blue sky

This is what we want to achieve. 

Why? We are working with Tibetan and elisions are not uncommon so that, e.g.:

   rin po che

on some occasions might be shortened to

   rin che

and we would like to have a query of

   rin po che

or

   rin che

find all occurrences of

   rin po che

and

   rin che

without having the user have to mark where elisions might occur.

The org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration provides
a setEnablePositionIncrements but that does not seem to work to allow for the above desired
query behavior that was possible prior to Lucene 4.4.

What is the proper way to ignore stop word gaps?

Thank you,
Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message