lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Lucene - Search breadth approach
Date Wed, 22 Jul 2009 20:02:05 GMT
A closely related approach to what Shai outlined is to index the
*original*token
with a special ender (say $) with a 0 increment (see SynonymAnalyzer
in LIA). Then, whenever you determined you wanted to use the un-stemmed
version, just add your token to the terms (i.e. testing$ when you didn't
want
the analyzer to match on the stemmed "test"). I'm pretty sure that the
presence
of the $ will short-circuit stemming, but you'll have to be sure that
whatever
analyzer you use doesn't strip it.

Best
Erick

On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera <serera@gmail.com> wrote:

> Hi Robert,
>
> What you could do is use the Stemmer (as a TokenFilter I assume) and
> produce
> two tokens always - the stem and the original. Index both of them in the
> same position.
>
> Then tell your users that if they search for [testing], it will find
> results
> for 'testing', 'test' etc (the stems) and if they search for ["testing"] it
> will do an "exact" search for the word 'testing'.
>
> If you choose to go this way, you'll need to override QueryParser and when
> you encounter a phrase, don't run it by the Analyzer you use which stems,
> but by a different Analyzer, maybe WhitespaceAnalyzer, so that it will
> produce the word 'testing'.
>
> With that approach BTW, you can search on stopwords that are part of the
> phrase, given of course that you also indexed them.
>
> I'm not aware of any class in Lucene that will allow you to produce two
> tokens, except may TeeTokenFilter. I wrote my own to do that and it's
> really
> not a big thing to do.
>
> Hope this helps,
>
> Shai
>
> On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java.rab@gmail.com>
> wrote:
>
> > Hello,
> >
> > I would like to use a stemming analyser similar to KStem or PorterStem to
> > provide access to a wider search scope for our users. However, at the
> same
> > time I also want to provide the ability for the users to throw out the
> > stems
> > if they want to search more accurately. I have a number of ideas as to
> the
> > best way to implement this.
> >
> > I can control the breadth of the search scope with a checkbox on the ui.
> > When the scope is wide, I will use the stems, when its narrow (or exact)
> > I'll avoid using the stems.
> >
> > The approach I envisage is to index the fields twice. Once using the
> > StandardAnalyser and a second time using the Stemmer. I'll attach a
> suffix
> > to the name of the stemmed set in the index. So for example, TITLE
> > (contains
> > only StandardAnalyser output) and TITLE_STEM (contains the StemAnalyser
> > output). When I come to generate the query object, I will first check the
> > search breath on the UI. If its wide, I'll use the TITLE_STEM column
> > parsing
> > the query with the StemAnalyser, otherwise I'll use the TITLE column with
> > the query being parsed with the StandardAnalyser.
> >
> > Although I appreciate it will result in a much larger index and longer
> > indexing time, this approach will allow me to implement the required
> > functionality.
> >
> > I just wanted to check with you guys that there is no better, perhaps
> more
> > efficient way of achieving my goals before taking the above approach. All
> > feedback / advice will be warmly received.
> >
> > Thanks guys!
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message