lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Polites" <jason.poli...@gmail.com>
Subject Re: Stop words in index
Date Mon, 04 Sep 2006 06:28:38 GMT
Hey,

Just a quick addendum to this original issue.

My first need was to ensure that stop words were not stored in the index
(which your helpful suggestion led me to confim); however this has raised a
second more scary issue.

It seems that because stop words are excluded from the index, quoted
sequences of tokens which include stop words will yield no results.

That is,

In the default StandardAnalyzer, the stop word list contains the word "on".
If I have a document which contains the phrase "Disney on Ice", the index
will show only "Disney" and "Ice", but not "on".

This is fine, and if the user searches for:

Disney on Ice

They will get a match.  But, it seems that a search for:

"Disney on Ice"

With the quotations indicating the desire for an "exact match", the absence
of stop words in the index means this yields zero results.

Am I going crazy here?

On 9/3/06, Jason Polites <jason.polites@gmail.com> wrote:
>
> Roger that.  I'll double check my code.
>
> Thanks.
>
>
> On 9/3/06, Otis Gospodnetic <otis_gospodnetic@yahoo.com > wrote:
> >
> > They shouldn't be in the index.  You must be using StandardAnalyzer
> > incorrectly, or maybe you think you are using it, but are really using
> > something else.
> >
> > Otis
> >
> > ----- Original Message ----
> > From: Jason Polites <jason.polites@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Saturday, September 2, 2006 9:05:27 AM
> > Subject: Stop words in index
> >
> > Hey all,
> >
> > I am using the StandardAnalyzer with my own list of stop words (which is
> > more comprehensive than the default list), and my expectation was that
> > this
> > would omit these stop words from the index when data is indexed using
> > this
> > analyzer.  However, I am seeing stop words in the term vector for
> > documents
> > indexed with this analyzer.
> >
> > Is this expected behaviour?  Is there any way I can force these stop
> > words
> > to be omitted from the index?  Having them in the index is wreaking
> > havoc
> > with term vector analysis to determine document similarity.
> >
> > Thanks.
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message