Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 91908 invoked from network); 11 Sep 2002 16:08:33 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 11 Sep 2002 16:08:33 -0000 Received: (qmail 20781 invoked by uid 97); 11 Sep 2002 16:09:08 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 20740 invoked by uid 97); 11 Sep 2002 16:09:07 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 20721 invoked by uid 98); 11 Sep 2002 16:09:06 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-Id: <006a01c259ad$e2c39d00$7e94bb81@majesty> Reply-To: "Clemens Marschner" From: "Clemens Marschner" To: "Lucene Developers List" Subject: Uppercase/lowercase in GermanStemmer Date: Wed, 11 Sep 2002 18:11:28 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 x-mimeole: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I had a problem with the German stemmer, since it tries to detect nouns by looking for an uppercase first letter. This information is only used when a word ends with "t" in which case it is not stemmed. However, it's very naive to think words are nouns if and only if they begin with a capital letter. They may also be at the beginning with a sentence or within a quote, in which cases they may be set uppercase. Much worse, however, is the fact that most people write their queries in lower case. That means words can be stemmed differently in the query than in the index, leading to different results if someone entered the query in upper- or lowercase. Example: The word "Fakult�ten" is stemmed to "fakultat", while "fakult�ten" becomes "fakulta". I commented the lines out in GermanStemmer (see below, the diff is from CVS version 2002-09-05). However, I'm not enough a linguist to tell whether it is too much to stem a trailing "t" from a noun. Regards --Clemens --- GermanStemmer.java~1~ 2002-08-19 07:13:42.000000000 +0000 +++ GermanStemmer.java 2002-09-11 15:59:53.000000000 +0000 @@ -72,7 +72,7 @@ /** * Indicates if a term is handled as a noun. */ - private boolean uppercase = false; +// private boolean uppercase = false; /** * Amount of characters that are removed with substitute() while stemming. @@ -88,7 +88,9 @@ protected String stem( String term ) { // Mark a possible noun. - uppercase = Character.isUpperCase( term.charAt( 0 ) ); + /* uppercase = Character.isUpperCase( term.charAt( 0 ) ); + Can't use this - People don't use uppercase words in ueries. --Clemens + */ // Use lowercase for medium stemming. term = term.toLowerCase(); if ( !isStemmable( term ) ) @@ -153,7 +155,7 @@ buffer.deleteCharAt( buffer.length() - 1 ); } // "t" occurs only as suffix of verbs. - else if ( buffer.charAt( buffer.length() - 1 ) == 't' && !uppercase ) { + else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&& !uppercase*/ ) { buffer.deleteCharAt( buffer.length() - 1 ); } else { -------------------------------------- http://www.cmarschner.net -- To unsubscribe, e-mail: For additional commands, e-mail: