lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Naber <lucenelist2...@danielnaber.de>
Subject Re: Analysis/tokenization of compound words
Date Wed, 20 Sep 2006 06:57:05 GMT
On Tuesday 19 September 2006 22:41, eks dev wrote:

> ahh, another one, when you strip suffix, check if last char on remaining
> "stem" is "s" (magic thing in German), delete it if not the only
> letter.... do not ask why, long unexplained mistery of German language

This is called "Fugenelement" and there are more characters than just the 
"s", although it might be enough to remove the "s" when trying to detect 
compounds. There are also cases where characters are removed (Wolle + 
Decke => Wolldecke).

Also see http://de.wikipedia.org/wiki/Fugenelement and 
http://en.wikipedia.org/wiki/Epenthesis (which emphasises pronunciation, 
but that's not a good explanation for the existence of these characters in 
German).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message