lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Alyea <dal...@gmail.com>
Subject Re: How to wildcard
Date Fri, 16 Nov 2012 16:34:06 GMT
I ended up with this:

<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"
side="front"/>

and it works great!  It's important to specify side or the N-gram
buildout is really huge.  My users generally will start typing their
wildcard searches left-anchored, so it was not only overkill to
have all the generated stems, but was causing way too many
false positives to hit.

To provide some on-the-fly documentation of the above, if
you have:

sm333k carbon shoes

the tokens generated, given my specs above, are:

sm3 sm33 sm333 sm333k car carb carbo carbon sho shoe shoes

For a word with 7+ characters, it would make the 4 N-grams
of length 3 to 6 starting with the 1st char.  It's like:

for (i=3..6) {
   token=substr(x, 0, i);
}

Thanks for pointing me in this direction!

On Thu, Nov 15, 2012 at 4:59 PM, Upayavira <uv@odoko.co.uk> wrote:

> Remember to distinguish between recall and precision - you're likely to
> get too many results, but what matters is whether the first ones are
> useful.
>
> You could have two versions of your field, one with normal stemming,
> another with n-grams, and boost the normal field above the n-gram one,
> give exact matches a boost above inexact matches.
>
> Upayavira
>
> On Thu, Nov 15, 2012, at 09:48 PM, David Alyea wrote:
> > OK, I tried that.  Had just Snowball and EdgeNGram
> > in both index and query.  When I ran the "sm3 carbon"
> > select, it went from 3,500 matches to 89,000!  So yes,
> > that edge building works!  But too much.  And... the
> > top score matches didn't look at all like "sm3 carbon"
> > products, and the shoes were no where in sight.  So,
> > I'll toy with it on a dev instance and see what I see.
> > I definitely like the idea and I can see that N-gram
> > tokens are going to behave like wildcarding.
> >
> > On Thu, Nov 15, 2012 at 4:13 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > On Thu, Nov 15, 2012 at 9:44 AM, David Alyea <dalyea@gmail.com> wrote:
> > > >
> > > > to index:
> > > > <filter class="solr.PorterStemFilterFactory"/>
> > > > <filter class="solr.KStemFilterFactory"/>
> > > > <filter class="solr.EnglishMinimalStemFilterFactory"/>
> > > >
> > > > to query:
> > > > <filter class="solr.SnowballPorterFilterFactory" language="English"
> />
> > > >
> > >
> > > I don't think its a good idea to use 4 different stemming algorithms
> > > (porter1, kstem, plural at index-time) and porter2 at query-time.
> > > This means you are analyzing terms in a totally different way at index
> > > time than you are at query-time.
> > >
> > > Just pick one of them: make your index-time and query-time analysis
> > > the same as a start and I think you will see less surprises.
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message