lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Lucene - Search breadth approach
Date Wed, 22 Jul 2009 20:12:46 GMT
Actually my stemming Analyzer adds a similar character to stems, to
distinguish between original tokens (like orig=test) to stems (testing -->
test$).

On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> A closely related approach to what Shai outlined is to index the
> *original*token
> with a special ender (say $) with a 0 increment (see SynonymAnalyzer
> in LIA). Then, whenever you determined you wanted to use the un-stemmed
> version, just add your token to the terms (i.e. testing$ when you didn't
> want
> the analyzer to match on the stemmed "test"). I'm pretty sure that the
> presence
> of the $ will short-circuit stemming, but you'll have to be sure that
> whatever
> analyzer you use doesn't strip it.
>
> Best
> Erick
>
> On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera <serera@gmail.com> wrote:
>
> > Hi Robert,
> >
> > What you could do is use the Stemmer (as a TokenFilter I assume) and
> > produce
> > two tokens always - the stem and the original. Index both of them in the
> > same position.
> >
> > Then tell your users that if they search for [testing], it will find
> > results
> > for 'testing', 'test' etc (the stems) and if they search for ["testing"]
> it
> > will do an "exact" search for the word 'testing'.
> >
> > If you choose to go this way, you'll need to override QueryParser and
> when
> > you encounter a phrase, don't run it by the Analyzer you use which stems,
> > but by a different Analyzer, maybe WhitespaceAnalyzer, so that it will
> > produce the word 'testing'.
> >
> > With that approach BTW, you can search on stopwords that are part of the
> > phrase, given of course that you also indexed them.
> >
> > I'm not aware of any class in Lucene that will allow you to produce two
> > tokens, except may TeeTokenFilter. I wrote my own to do that and it's
> > really
> > not a big thing to do.
> >
> > Hope this helps,
> >
> > Shai
> >
> > On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java.rab@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I would like to use a stemming analyser similar to KStem or PorterStem
> to
> > > provide access to a wider search scope for our users. However, at the
> > same
> > > time I also want to provide the ability for the users to throw out the
> > > stems
> > > if they want to search more accurately. I have a number of ideas as to
> > the
> > > best way to implement this.
> > >
> > > I can control the breadth of the search scope with a checkbox on the
> ui.
> > > When the scope is wide, I will use the stems, when its narrow (or
> exact)
> > > I'll avoid using the stems.
> > >
> > > The approach I envisage is to index the fields twice. Once using the
> > > StandardAnalyser and a second time using the Stemmer. I'll attach a
> > suffix
> > > to the name of the stemmed set in the index. So for example, TITLE
> > > (contains
> > > only StandardAnalyser output) and TITLE_STEM (contains the StemAnalyser
> > > output). When I come to generate the query object, I will first check
> the
> > > search breath on the UI. If its wide, I'll use the TITLE_STEM column
> > > parsing
> > > the query with the StemAnalyser, otherwise I'll use the TITLE column
> with
> > > the query being parsed with the StandardAnalyser.
> > >
> > > Although I appreciate it will result in a much larger index and longer
> > > indexing time, this approach will allow me to implement the required
> > > functionality.
> > >
> > > I just wanted to check with you guys that there is no better, perhaps
> > more
> > > efficient way of achieving my goals before taking the above approach.
> All
> > > feedback / advice will be warmly received.
> > >
> > > Thanks guys!
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message