lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: How to include a multi-word synonym to a word when indexing?
Date Mon, 11 Apr 2005 16:16:12 GMT
On Apr 11, 2005, at 9:36 AM, Peter Hotm. Nørregaard wrote:
> According to "Lucene in Action" it is possible to get synonyms indexed 
> together with a word by putting multiple words with the same 
> position-id in the term vector.
>
> My problem is, however, that some words needs to have alternatives 
> where the word is decomposed / decompounded into two or more words:
>
> "FooBar Corp" or "cybercafe"
>
> should be found when searching for
>
> "Foo Ba*" or "cyber cafe"

First, the phrase query with wildcards is not currently a built-in 
capability of Lucene.

You'll need some kind of lookup to know how to split a token like 
"cybercafe" into two words - once you've done that it will be easy to 
set the position increment of them to zero so that they overlay the 
original term.

> The reverse is also true: The "Foo Bar Corp" should be found with 
> "Foob* corp".

Again the caveat about wildcard phrase queries applies.  You'll need to 
use that same lookup mechanism to combine multiple tokens coming 
through a token filter into a single combined one.


> So how do I solve this problem?

The approach I've described above is a crude dictionary-based one, but 
I'm sure other tricks could be employed but they would be quite 
sophisticated (N-gram sequencing ala LingPipe comes to mind).

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message