lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tomlinson <chris.j.tomlin...@gmail.com>
Subject Re: What is the proper use of stop words in Lucene?
Date Mon, 28 Apr 2014 16:16:03 GMT
Hello Uwe,

Thank you for the reply. I see that there is a version check for the use of setEnablePositionIncrements(false);
and, I think I may be able to use an earlier api with the eXist-db embedding of Lucene 4.4
to avoid the version check.

However, my question was intended to improve my understanding of how to properly use stop
words and/or how to properly achieve the use case that I outlined.

My naive understanding of the purpose of stop words is:

        to remove from indexing words that are not helpful in discriminating or selecting
documents since they occur so frequently.

The use case that I intended to illustrate is meant to ignore the occurrence or non-occurrence
of stop words in a query w.r.t. selection of documents during search.

As I understand the situation currently, occurrences of stop words in a query phrase are replaced
by "?"s to indicate the presence of an otherwise unspecified word in the query. So the phrase:

        blue is the moon

with "is" and "the" as stop words, would be indexed effectively as:

        blue ? ? moon

and the query phrase:

        blue was a moon

would be treated as:

        blue ? ? moon

and would retrieve a document containing:

        blue is the moon

But in the use case that I presented we really want the query:

        blue moon

to also select the document without the user having to indicate the possible presence of stop
words or not.

So my question is:

        How is one supposed to achieve this use case in Lucene 4.4+?

Thank you,
Chris




On Apr 24, 2014, at 5:52 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
> 
> You can still change the setting on the TokenFilter after creating it: StopFilter#setEnablePositionIncrements(false)
- this method was *not* removed!
> This fails only is you pass matchVersion>=Version.LUCENE_44. Just use an older matchVersion
parameter to the constructor and you can still enable this broken behavior (for backwards
compatibility).
> 
> This is no longer officially supported, but can be a workaround. To me it looks like
you misunderstood stopwords.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
>> -----Original Message-----
>> From: Tincu Gabriel [mailto:tincu.gabriel@gmail.com]
>> Sent: Thursday, April 24, 2014 12:27 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What is the proper use of stop words in Lucene?
>> 
>> Hi there,
>> The StopFilterFactory can be used to produce StopFilters with the desired
>> stop-words inside of it . As a constructor argument it takes a
>> Map<String,String> and one of the valid keys you can pass inside of that is
>> "enablePositionIncrements" . If you don't pass that in then it defaults to true.
>> Is this what you were looking for?
>> 
>> 
>> On Wed, Apr 23, 2014 at 12:36 PM, Chris Tomlinson <
>> chris.j.tomlinson@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> I've written several times now on the list with this question /
>>> problem and no one has yet replied so I don't know if the question is
>>> too wrong-headed or if there is simply no one reading the list that
>>> can comment on the question.
>>> 
>>> The question that I'm trying to get answered is what is the correct
>>> way of ignoring stop word gaps in Lucene 4.4+?
>>> 
>>> While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I
>>> think the question is a proper Lucene question and really has nothing
>>> to do with the fact that we're using it in an embedded manner.
>>> 
>>> The problem to be solved is how to ignore stop word gaps in queries -
>>> without the user having to indicate where such gaps might occur at
>>> query time.
>>> 
>>> Since Lucene 4.4 the
>>> FilteringTokenFilter.setEnablePositionIncrements(false) is not available.
>>> None of the resources such as the "Lucene in Action" and so on explain
>>> how to use Lucene to get the desired effect now that 4.4+ has removed
>>> the previous approach.
>>> 
>>> Prior to Lucene 4.4 it was possible to
>>> setEnablePositionIncrements(false)
>>> so that during indexing and querying the number and position of stop
>>> word gaps would be ignored (as mentioned on pp 138-139 of "Lucene in
>> Action").
>>> 
>>> This meant that a document with a phrase such as:
>>> 
>>>   blue is the sky
>>> 
>>> with stop words "is" and "the" would be selected by the query:
>>> 
>>>   blue sky
>>> 
>>> This is what we want to achieve.
>>> 
>>> Why? We are working with Tibetan and elisions are not uncommon so
>>> that,
>>> e.g.:
>>> 
>>>   rin po che
>>> 
>>> on some occasions might be shortened to
>>> 
>>>   rin che
>>> 
>>> and we would like to have a query of
>>> 
>>>   rin po che
>>> 
>>> or
>>> 
>>>   rin che
>>> 
>>> find all occurrences of
>>> 
>>>   rin po che
>>> 
>>> and
>>> 
>>>   rin che
>>> 
>>> without having the user have to mark where elisions might occur.
>>> 
>>> The
>>> 
>> org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfi
>>> guration provides a setEnablePositionIncrements but that does not seem
>>> to work to allow for the above desired query behavior that was
>>> possible prior to Lucene 4.4.
>>> 
>>> What is the proper way to ignore stop word gaps?
>>> 
>>> Thank you,
>>> Chris
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message