lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan O'Connor" <jonathan.ocon...@xcom.de>
Subject Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
Date Tue, 01 Mar 2005 11:18:55 GMT
Jon,
I too found some problems with the German analyser recently. Here's what 
may help:
1. You can try reading Joerg Caumanns' paper "A Fast and Simple Stemming 
Algorithm for German Words". This paper describes the algorithm 
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although you 
would want to be indexing well written German and not emails or text 
messages!
3. The German Stemmer converts umlauts into some funny form (the code is a 
bit tricky, and I didn't spend any time looking at it), so maybe thats why 
you can't find umlauts properly. I think the main reason for this umlaut 
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser 
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on Luke. I 
just got it last week, and its brilliant. It shows you everything about 
your indexes. You can also feed text to an Analyser, and see what it makes 
of it. This will show you the real reason why your umlaut search is 
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin



"Jon Humble" <jon.humble@tecsphere.com> 
01/03/2005 09:35
Please respond to
"Lucene Users List" <lucene-user@jakarta.apache.org>


To
<lucene-user@jakarta.apache.org>
cc

Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






Hello,
 
We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
 
(1)     Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2)     With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results. 
(3)     In a similar vein to (2), wildcard searches with escaped special
characters fail to find results. So a search for co\-operative works but
a search for co\-op* fails.
 
I will be grateful for any light that can be shed on these problems.
 
With Thanks,
 
Jon.
 
Jon Humble
BSc (hons,)
Software Engineer
eMail: jon.humble@tecsphere.com

TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
 
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com
 
 




*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in Berlin (05.03.2005) 
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch
durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten
dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig
zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of the intended
recipient. Any review, distribution by others or forwarding without express permission is
strictly prohibited. If you are not the intended recipient, please contact the sender and
delete all copies.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message