lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Searching any part of a string
Date Wed, 25 Jun 2008 18:25:25 GMT
Warning: I don't understand ngrams at all, so you should
read this as a plea for those who do to tell me I'm off base <G>.


But I wonder if indexing as n-grams would be a way to
cope with this issue that lots of people have. *assuming*
you are thinking about single terms, then it seems that
"smith" would be tokenized as sm, mi, it, th. Then
a wildcard search for "mi it" would hit (as a phrase
query or a SpanQuery with slop of 0). It seems like there
are several issues to work out here, especially including
multiple terns, matching mixtures of wildcards and
non-wildcards, etc.

But it seems do-able....


Another approach is to use WildcardTernEnum and/or
RegexTermEnum to build up a filter and use the filter as
part of the query. What you loose with this approach is
that the filter (and wildcards) then don't contribute to
scoring. But this isn't a huge price to pay...

Best
Erick

On Wed, Jun 25, 2008 at 1:47 PM, Mark Ferguson <mark.a.ferguson@gmail.com>
wrote:

> Hello,
>
> I am currently keeping an index of all our client's usernames. The search
> functionality is implemented using a PrefixFilter. However, we would like
> to
> expand the functionality to be able to search any part of a user's name,
> rather than requiring that it begin with the query string. So for example,
> the search term 'mit' would return the username 'smith'.
>
> I am hesitant to use a WildcardQuery starting with an asterisk because I've
> read about why this is a bad idea. I am looking for suggestions on the best
> way to implement this.
>
> The idea I've come up with is to index each part of the username; so for
> example, if the username is 'mark', you would index mark, ark, rk, and k.
> Then you could still use the PrefixFilter. I'm not overly concerned about
> how this would enlarge the index because usernames tend to be fairly short.
>
> I am very much open to other suggestions however. Does anyone have any
> opinions or ideas that they can share?
>
> Thanks very much.
>
> Mark
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message