lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Ferguson" <>
Subject Re: Searching any part of a string
Date Fri, 27 Jun 2008 15:57:31 GMT
Hi Erick,

Thanks for the suggestions. I've used indexed n-grams before to implement
spell-checking; I think in this case I may take a look at WildcardTermEnum
and RegexTermEnum. It seems like a good solution because I am doing my own
results ordering so Lucene's scoring is irrelevant in this case. I wasn't
aware of these classes so thanks for mentioning them!



On Wed, Jun 25, 2008 at 12:25 PM, Erick Erickson <>

> Warning: I don't understand ngrams at all, so you should
> read this as a plea for those who do to tell me I'm off base <G>.
> But I wonder if indexing as n-grams would be a way to
> cope with this issue that lots of people have. *assuming*
> you are thinking about single terms, then it seems that
> "smith" would be tokenized as sm, mi, it, th. Then
> a wildcard search for "mi it" would hit (as a phrase
> query or a SpanQuery with slop of 0). It seems like there
> are several issues to work out here, especially including
> multiple terns, matching mixtures of wildcards and
> non-wildcards, etc.
> But it seems do-able....
> Another approach is to use WildcardTernEnum and/or
> RegexTermEnum to build up a filter and use the filter as
> part of the query. What you loose with this approach is
> that the filter (and wildcards) then don't contribute to
> scoring. But this isn't a huge price to pay...
> Best
> Erick
> On Wed, Jun 25, 2008 at 1:47 PM, Mark Ferguson <>
> wrote:
> > Hello,
> >
> > I am currently keeping an index of all our client's usernames. The search
> > functionality is implemented using a PrefixFilter. However, we would like
> > to
> > expand the functionality to be able to search any part of a user's name,
> > rather than requiring that it begin with the query string. So for
> example,
> > the search term 'mit' would return the username 'smith'.
> >
> > I am hesitant to use a WildcardQuery starting with an asterisk because
> I've
> > read about why this is a bad idea. I am looking for suggestions on the
> best
> > way to implement this.
> >
> > The idea I've come up with is to index each part of the username; so for
> > example, if the username is 'mark', you would index mark, ark, rk, and k.
> > Then you could still use the PrefixFilter. I'm not overly concerned about
> > how this would enlarge the index because usernames tend to be fairly
> short.
> >
> > I am very much open to other suggestions however. Does anyone have any
> > opinions or ideas that they can share?
> >
> > Thanks very much.
> >
> > Mark
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message