lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Lucene - Search breadth approach
Date Wed, 22 Jul 2009 20:40:41 GMT
But as far as I know, it doesn't index the original termtoo (at the same
offset), which you have to do if you
want to distinguish between the two cases, I think.

But I confess I've been out of the guts of Lucene for some
time, so I could be waaaaay off.

But you'd sure want to use a different token <G>....

Erick


On Wed, Jul 22, 2009 at 4:12 PM, Shai Erera <serera@gmail.com> wrote:

> Actually my stemming Analyzer adds a similar character to stems, to
> distinguish between original tokens (like orig=test) to stems (testing -->
> test$).
>
> On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > A closely related approach to what Shai outlined is to index the
> > *original*token
> > with a special ender (say $) with a 0 increment (see SynonymAnalyzer
> > in LIA). Then, whenever you determined you wanted to use the un-stemmed
> > version, just add your token to the terms (i.e. testing$ when you didn't
> > want
> > the analyzer to match on the stemmed "test"). I'm pretty sure that the
> > presence
> > of the $ will short-circuit stemming, but you'll have to be sure that
> > whatever
> > analyzer you use doesn't strip it.
> >
> > Best
> > Erick
> >
> > On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera <serera@gmail.com> wrote:
> >
> > > Hi Robert,
> > >
> > > What you could do is use the Stemmer (as a TokenFilter I assume) and
> > > produce
> > > two tokens always - the stem and the original. Index both of them in
> the
> > > same position.
> > >
> > > Then tell your users that if they search for [testing], it will find
> > > results
> > > for 'testing', 'test' etc (the stems) and if they search for
> ["testing"]
> > it
> > > will do an "exact" search for the word 'testing'.
> > >
> > > If you choose to go this way, you'll need to override QueryParser and
> > when
> > > you encounter a phrase, don't run it by the Analyzer you use which
> stems,
> > > but by a different Analyzer, maybe WhitespaceAnalyzer, so that it will
> > > produce the word 'testing'.
> > >
> > > With that approach BTW, you can search on stopwords that are part of
> the
> > > phrase, given of course that you also indexed them.
> > >
> > > I'm not aware of any class in Lucene that will allow you to produce two
> > > tokens, except may TeeTokenFilter. I wrote my own to do that and it's
> > > really
> > > not a big thing to do.
> > >
> > > Hope this helps,
> > >
> > > Shai
> > >
> > > On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java.rab@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I would like to use a stemming analyser similar to KStem or
> PorterStem
> > to
> > > > provide access to a wider search scope for our users. However, at the
> > > same
> > > > time I also want to provide the ability for the users to throw out
> the
> > > > stems
> > > > if they want to search more accurately. I have a number of ideas as
> to
> > > the
> > > > best way to implement this.
> > > >
> > > > I can control the breadth of the search scope with a checkbox on the
> > ui.
> > > > When the scope is wide, I will use the stems, when its narrow (or
> > exact)
> > > > I'll avoid using the stems.
> > > >
> > > > The approach I envisage is to index the fields twice. Once using the
> > > > StandardAnalyser and a second time using the Stemmer. I'll attach a
> > > suffix
> > > > to the name of the stemmed set in the index. So for example, TITLE
> > > > (contains
> > > > only StandardAnalyser output) and TITLE_STEM (contains the
> StemAnalyser
> > > > output). When I come to generate the query object, I will first check
> > the
> > > > search breath on the UI. If its wide, I'll use the TITLE_STEM column
> > > > parsing
> > > > the query with the StemAnalyser, otherwise I'll use the TITLE column
> > with
> > > > the query being parsed with the StandardAnalyser.
> > > >
> > > > Although I appreciate it will result in a much larger index and
> longer
> > > > indexing time, this approach will allow me to implement the required
> > > > functionality.
> > > >
> > > > I just wanted to check with you guys that there is no better, perhaps
> > > more
> > > > efficient way of achieving my goals before taking the above approach.
> > All
> > > > feedback / advice will be warmly received.
> > > >
> > > > Thanks guys!
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message