lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Searching on plurals and phrases in a single field
Date Wed, 12 Dec 2007 19:34:00 GMT
I faced a very similar requirement and solved it by indexing multiple
tokens at the same place. For instance, say you're indexing
the word "foxes". Index something like fox$ and foxes at the same
position (see SynonymAnalyzer in Lucene In Action for an example).
You probably MUST index the multiple terms with an increment gap
of 0 (more later).

Example phrase "red foxes are plentiful"

Now you have the capability of distinguishing between a stemmed
and unstemmed version of the word and can search for exactly
"red foxes". But if instead you want to search for the stemmed
version, you can search for "red fox$". But "red fox" will NOT match.

The reason you need to index these with an increment gap of 0 is so
phrase queries work. If you let the gap increment for each token, and
indexed a phrase like "red foxes are plentiful", then did a
proximity search on "red plentiful"~2, it would fail because
you'd have fox$ and foxes each taking up one position. But if
fox$ and foxes both have the same position, it'll work.

And it's all in the same index, one field, etc.

Hope this helps
Erick

On Dec 12, 2007 1:25 PM, Lucifer Hammer <lucener@gmail.com> wrote:

> Hi,
>
> We've got a requirement that we need to give our users  the ability to
> search on exact phrases within a field, or, if they prefer, they can match
> on plurals(either via stems, or another plural algorithm).  However, the
> cases are mutually exclusive, for example given the following field in the
> index:
>
> IndexField1: "The quick brown dog jumped over the lazy fox"
>
> If the user chooses an exact phrase search such as "lazy dog jumped", then
> it'll match, however, if they also choose an exact phrase  and search for:
> "lazy dogs" it shouldn't match.
>
> If the user chooses a plural search, then  both of the above searches
> should
> match.
>
> So... the question really is:  Can I do this all in one field, or will I
> have to index the data twice, once in a field that has the exact text, and
> in a second field in which I index the terms, and wordstack stems(or
> plurals).
>
> If it's possible to do this in a single field, that would be much
> preferred...
>
> Thanks for any help!
> Lucifer
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message