lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <c...@lanlab.de>
Subject Uppercase/lowercase in GermanStemmer
Date Wed, 11 Sep 2002 16:11:28 GMT
I had a problem with the German stemmer, since it tries to detect nouns by
looking for an uppercase first letter.
This information is only used when a word ends with "t" in which case it is
not stemmed.

However, it's very naive to think words are nouns if and only if they begin
with a capital letter. They may also be at the beginning with a sentence or
within a quote, in which cases they may be set uppercase.

Much worse, however, is the fact that most people write their queries in
lower case. That means words can be stemmed differently in the query than in
the index, leading to different results if someone entered the query in
upper- or lowercase.
Example: The word "Fakult├Ąten" is stemmed to "fakultat", while "fakult├Ąten"
becomes "fakulta".

I commented the lines out in GermanStemmer (see below, the diff is from CVS
version 2002-09-05).
However, I'm not enough a linguist to tell whether it is too much to stem a
trailing "t" from a noun.

Regards

--Clemens


--- GermanStemmer.java~1~       2002-08-19 07:13:42.000000000 +0000
+++ GermanStemmer.java  2002-09-11 15:59:53.000000000 +0000
@@ -72,7 +72,7 @@
     /**
      * Indicates if a term is handled as a noun.
      */
-    private boolean uppercase = false;
+//    private boolean uppercase = false;

     /**
      * Amount of characters that are removed with <tt>substitute()</tt>
while stemming.
@@ -88,7 +88,9 @@
     protected String stem( String term )
     {
        // Mark a possible noun.
-       uppercase = Character.isUpperCase( term.charAt( 0 ) );
+       /* uppercase = Character.isUpperCase( term.charAt( 0 ) );
+       Can't use this - People don't use uppercase words in
ueries.  --Clemens
+    */
        // Use lowercase for medium stemming.
        term = term.toLowerCase();
        if ( !isStemmable( term ) )
@@ -153,7 +155,7 @@
                buffer.deleteCharAt( buffer.length() - 1 );
            }
            // "t" occurs only as suffix of verbs.
-           else if ( buffer.charAt( buffer.length() - 1 ) == 't' &&
!uppercase ) {
+           else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&&
!uppercase*/ ) {
                buffer.deleteCharAt( buffer.length() - 1 );
            }
            else {




--------------------------------------
http://www.cmarschner.net


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message