lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerhard Schwarz <>
Subject Re: Problem with tokenizing/stemming in GermanAnalyzer
Date Mon, 17 Feb 2003 16:44:18 GMT
Christoph Kiehl wrote:
> Hi Volker,
>>I have noticed a strange problem with capitalization. Search for
>>"computer" results in the token "compu". Search for "Computer",
>>however, results in "comput". The search is supposed to be
>>case-insensitive, so this must be a bug, right?
> This problem was already mentioned on the developer list. The analyzer tries
> to do some noun recognition. But it does a bad job ;)

The analyzer should not do any case-recognition. After I read through 
the mailing list from the last weeks/months (I was busy last weeks), I 
found out that a super simple unique-discrimination algorithm is what 
the most users need. The original algorithm has more possible ways to 
extend it.

> For now you could check out the current lucene version from cvs and just
> comment out the following line:
>  uppercase = Character.isUpperCase( term.charAt( 0 ) );
> Then just run ant to built the jar. This fixes the problem you described.

I promise I will check the stemmer next days... hm... not before this 
weekend, i have a martial arts challenge at sunday. Mental i'm not 
prepared to _fix_ anything. :)

There is another problem with the Umlaut-conversion that also should be 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message