lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerhard Schwarz <gerhard.schw...@fpg.de>
Subject Re: Strange Results with German Analyzer
Date Thu, 20 Dec 2001 14:37:36 GMT
Hello,

Jan Stvvesand wrote:
> 
> Hi,
> 
> I used a German Analyzer for Indexing and Searching. afaik, the search is
> case insensitive. At least I get the same searchresults for
> 
> kapitalanlagen
> Kapitalanlagen
> 
> But, for some words the Analyzer behaves somewhat funny:
> 
> Holland -> 22 results
> hollAnd -> 22 results
> hollanD -> 22 results
> HOLLAND -> 22 results
> 
> holland -> 1 result (!) which is NOT in the 22 results mentioned above.

That result is correct.
 
> I have no idea and my knowledge about Searching, stemming, indexing etc is,
> well, small.

Well, I try to explain it in short.
Words starting with an uppercase letter become stemmed an other way
than all other words. Words containing one uppercase letter that is
not starting the word and words containing more than one uppercase
letter become will not be stemmed. 

So the stemming looks like this:

Holland -> possibly noun -> stemmed to "holland" 
hollAnd -> not a regular german word -> ignored -> lowercasefilter ->
"holland"
hollanD -> not a regular german word -> ignored -> lowercasefilter ->
"holland"
HOLLAND -> not a regular german word -> ignored -> lowercasefilter ->
"holland"
holland -> stemmed to "holla" ("nd" is a suffix to be stripped from
non-nouns)

It looks like the check for irregular words need some improvement,
it should be less restrictive with possibly mistyped words.

Another thing is, that the search _is_ case sensitive when the
GermanAnalyzer is used. This is because in german you should search
a substantive as a substantive. And stemming nouns a different way
than the rest gives much better results than medium stemming that
ignores case from the beginning.


HTH,
Gerhard

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message