lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <ysee...@gmail.com>
Subject Re: Indexing Dash concatenated words vs SynonymAnalyzer
Date Tue, 20 Jun 2006 19:42:51 GMT
On 6/20/06, Martin Braun <mbraun@uni-hd.de> wrote:
> german words are often dash-concatenated, e.g. West-Berlin or something
> like "C*-algebras and W*-algebras".
>
> I tend to write my own analyzer like the SynonymAnalyzer from the
> LIA-Book. I want to Index these words like this:
>
> West-Berlin => Westberlin | West | Berlin | "West Berlin"
> C*-algebras => c| algebra | calgebra

Hi Martin,

Solr's WordDelimiterFilter can index West-Berlin as "West" | "Berlin"
| "WestBerlin".
It currently indexes "Berlin" after "West" and "WestBerlin" at the
same position as "Berlin", so prase matches like "West Berlin" will
still work.

It does this automatically, even if there isn't a "-" inbetween the
words, so "WestBerlin" in the document would be indexed the same as
"West-Berlin" by default.


> The difference to the SynonymAnalyzer will be that one word will be
> separated in to two words. So that it is not a Synonym like  quick <=>
> fast, but something like quick <=> "lightning fast".
>
> Is it possible to get two words as a synonym at the same increment
> position during indexing? What will happen with a phrase search?

I don't know if you would still need it after the WordDelimiterFilter,
but Solr's SynonymFilter handles multi-token synonyms and multiple
synonyms at the same position.

It's all open-source, so you can yank them out and use them for your
own uses, or try using Solr itself :-)

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message