lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Lucene - Search breadth approach
Date Wed, 22 Jul 2009 13:16:01 GMT
Hi Robert,

What you could do is use the Stemmer (as a TokenFilter I assume) and produce
two tokens always - the stem and the original. Index both of them in the
same position.

Then tell your users that if they search for [testing], it will find results
for 'testing', 'test' etc (the stems) and if they search for ["testing"] it
will do an "exact" search for the word 'testing'.

If you choose to go this way, you'll need to override QueryParser and when
you encounter a phrase, don't run it by the Analyzer you use which stems,
but by a different Analyzer, maybe WhitespaceAnalyzer, so that it will
produce the word 'testing'.

With that approach BTW, you can search on stopwords that are part of the
phrase, given of course that you also indexed them.

I'm not aware of any class in Lucene that will allow you to produce two
tokens, except may TeeTokenFilter. I wrote my own to do that and it's really
not a big thing to do.

Hope this helps,

Shai

On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java.rab@gmail.com> wrote:

> Hello,
>
> I would like to use a stemming analyser similar to KStem or PorterStem to
> provide access to a wider search scope for our users. However, at the same
> time I also want to provide the ability for the users to throw out the
> stems
> if they want to search more accurately. I have a number of ideas as to the
> best way to implement this.
>
> I can control the breadth of the search scope with a checkbox on the ui.
> When the scope is wide, I will use the stems, when its narrow (or exact)
> I'll avoid using the stems.
>
> The approach I envisage is to index the fields twice. Once using the
> StandardAnalyser and a second time using the Stemmer. I'll attach a suffix
> to the name of the stemmed set in the index. So for example, TITLE
> (contains
> only StandardAnalyser output) and TITLE_STEM (contains the StemAnalyser
> output). When I come to generate the query object, I will first check the
> search breath on the UI. If its wide, I'll use the TITLE_STEM column
> parsing
> the query with the StemAnalyser, otherwise I'll use the TITLE column with
> the query being parsed with the StandardAnalyser.
>
> Although I appreciate it will result in a much larger index and longer
> indexing time, this approach will allow me to implement the required
> functionality.
>
> I just wanted to check with you guys that there is no better, perhaps more
> efficient way of achieving my goals before taking the above approach. All
> feedback / advice will be warmly received.
>
> Thanks guys!
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message