lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 12569] New: - Uppercase/lowercase distinction in GermanStemmer not sustainable
Date Thu, 12 Sep 2002 10:53:30 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12569>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12569

Uppercase/lowercase distinction in GermanStemmer not sustainable

           Summary: Uppercase/lowercase distinction in GermanStemmer not
                    sustainable
           Product: Lucene
           Version: CVS Nightly - Specify date in submission
          Platform: All
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: QueryParser
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: cmad@lanlab.de


I had a problem with the German stemmer, since it tries to detect nouns by
looking for an uppercase first letter.
This information is only used when a word ends with "t" in which case it is
not stemmed.

However, it's very naive to think words are nouns if and only if they begin
with a capital letter. They may also be at the beginning with a sentence or
within a quote, in which cases they may be set uppercase.

Much worse, however, is the fact that most people write their queries in
lower case. That means words can be stemmed differently in the query than in
the index, leading to different results if someone entered the query in
upper- or lowercase.
Example: The word "Fakultšten" is stemmed to "fakultat", while "fakultšten"
becomes "fakulta".

I commented the lines out in GermanStemmer (see below, the diff is from CVS
version 2002-09-05).
However, I'm not enough a linguist to tell whether it is too much to stem a
trailing "t" from a noun.

Regards

--Clemens


--- GermanStemmer.java~1~       2002-08-19 07:13:42.000000000 +0000
+++ GermanStemmer.java  2002-09-11 15:59:53.000000000 +0000
@@ -72,7 +72,7 @@
     /**
      * Indicates if a term is handled as a noun.
      */
-    private boolean uppercase = false;
+//    private boolean uppercase = false;

     /**
      * Amount of characters that are removed with <tt>substitute()</tt> while

stemming.
@@ -88,7 +88,9 @@
     protected String stem( String term )
     {
        // Mark a possible noun.
-       uppercase = Character.isUpperCase( term.charAt( 0 ) );
+       /* uppercase = Character.isUpperCase( term.charAt( 0 ) );
+       Can't use this - People don't use uppercase words in queries.  --Clemens
+    */
        // Use lowercase for medium stemming.
        term = term.toLowerCase();
        if ( !isStemmable( term ) )
@@ -153,7 +155,7 @@
                buffer.deleteCharAt( buffer.length() - 1 );
            }
            // "t" occurs only as suffix of verbs.
-           else if ( buffer.charAt( buffer.length() - 1 ) == 't' && !
uppercase ) {
+           else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&& !
uppercase*/ ) {
                buffer.deleteCharAt( buffer.length() - 1 );
            }
            else {

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message