From lucene-dev-return-3118-qmlist-jakarta-archive-lucene-dev=nagoya.apache.org@jakarta.apache.org Thu Feb 13 06:49:44 2003 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 19764 invoked from network); 13 Feb 2003 06:49:42 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 13 Feb 2003 06:49:42 -0000 Received: (qmail 5832 invoked by uid 97); 13 Feb 2003 06:51:23 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 5825 invoked from network); 13 Feb 2003 06:51:23 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 13 Feb 2003 06:51:23 -0000 Received: (qmail 19258 invoked by uid 500); 13 Feb 2003 06:49:37 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 19238 invoked from network); 13 Feb 2003 06:49:36 -0000 Received: from waterstof-v1.priv.quicknet.nl (HELO smtp-out.quicknet.nl) (213.73.255.38) by daedalus.apache.org with SMTP; 13 Feb 2003 06:49:36 -0000 Received: from whale (qn-212-58-174-237.quicknet.nl [212.58.174.237]) by mta1.priv.quicknet.nl (iPlanet Messaging Server 5.2 HotFix 1.07 (built Nov 25 2002)) with SMTP id <0HA800647IYEWU@mta1.priv.quicknet.nl> for lucene-dev@jakarta.apache.org; Thu, 13 Feb 2003 07:49:27 +0100 (MET) Date: Thu, 13 Feb 2003 07:57:42 +0100 From: maurits van wijland Subject: Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms() To: Lucene Developers List Cc: Christoph Kiehl , tatu@hypermall.net Message-id: <049d01c2d32d$366716f0$0200a8c0@whale> MIME-version: 1.0 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Mailer: Microsoft Outlook Express 6.00.2800.1106 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: 8BIT X-Priority: 3 X-MSMail-priority: Normal References: <3E4A8425.6050602@lucene.com> <200302121843.43835.tatu@hypermall.net> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi all, Maybe it we should start using stemming in a different maner. Look at it from the perspective of queryexpansion. In case we store stems in a different table, we will not have this problem! So, each token in stored in the index as a term. Each term is stemmed with the appropriate stemmer Store each stem and unstemed term in a separate index. We could then, search using the terms entered, and firstfind all the terms that match the WildcardQuery. Next,you coulde use the terms found, and then stem them. >From there, you retrieve all the terms related to that stem! Finally, search for documents with all terms retrieved. This would give an extra option for end users, turning query expansion on or off. Your thoughts, please. kind regards, Maurits. ----- Original Message ----- From: "Tatu Saloranta" To: "Lucene Developers List" Sent: Thursday, February 13, 2003 2:43 AM Subject: Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms() > On Wednesday 12 February 2003 11:39, Christoph Kiehl wrote: > > Hi Doug, > > > > > Also, I think we should lowercase prefix and wildcard queries by > ... > > > wildcard searches. What do others think? > > > > For the StandardAnalyzer this might work, but for the GermanAnalyzer, there > > Solving this problem should be easier after refactoring, just > override 'getPrefixQuery()' and 'getWildcardQuery' (see below for one possible > idea of what could be done). > > Another possibility would be to have another property for enabling use of same > analyzer used for normal terms for wildcard/prefix queries. > > However, using typical analyzers is not something one usually wants to do > for couple of reasons: > > - Wildcards are discarded by analyzer, so wildcard query will get broken (ie. > one needs wildcard-char - aware analyzer) > - Stemming can only be done for prefix queries (what is stem of, > say, "hä*er"?), and even then it might not produce stem one would > want. For example, for prefix query "men*" might be 'stemmed' to > "man*", and user might be perplexed at why documents with > words like "meningitis" and "menstrual" did not match (ok, that is > a contrived example, but hope you get the idea). > In a way, you could think that user is doing "manual stemming", using > a stem of a word with prefix query. > > In case of german, if umlaut chars are typically converted, perhaps you could > create a GermanQueryParser.java that just extends default query parser, and > does necessary transformation for wildcard/prefix queries? Since there > already exists separate language-dependant stemmers, this might make sense? > > > is also the problem with Umlauts (ä,ö,ü) turned into vowels (a,o,u) while > > indexing. An example: "Häuser" is the plural of "Haus". If I index "Häuser" > > it is stemmed to "hau". If I do for example a search for "häus*" nothing is > > Not "haus"? > > > found, because "häus" is not stemmed. If I would analyze "häus*" I should > > get "hau*". The problem is, that now you do not only get "Häuser" but also > > "Haus" as result. But I think it is better to get more results than no > > result. This is perhaps a special problem with the GermanAnalyzer. May be > > there could be an option to use the Analyzer also for wildcard queries. So > > I can turn it on in my case and defaults to off. > > Hope you understand my problem ;) > > Yes I do... I don't even dare to think of problems finnish analyzer might > have, with stemming. :-) > > -+ Tatu +- > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org