Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 39470 invoked from network); 12 Feb 2003 03:46:23 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 12 Feb 2003 03:46:23 -0000 Received: (qmail 24801 invoked by uid 97); 12 Feb 2003 03:48:02 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 24794 invoked from network); 12 Feb 2003 03:48:02 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 12 Feb 2003 03:48:02 -0000 Received: (qmail 39210 invoked by uid 500); 12 Feb 2003 03:46:21 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 39197 invoked from network); 12 Feb 2003 03:46:21 -0000 Received: from mail2.hypermall.com (216.241.37.118) by daedalus.apache.org with SMTP; 12 Feb 2003 03:46:21 -0000 Received: from [216.241.38.72] (helo=www.doomdark.org) by mail2.hypermall.com with esmtp (Exim 3.36 #1) id 18inqw-0004Ls-00 for lucene-dev@jakarta.apache.org; Tue, 11 Feb 2003 20:46:30 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Tatu Saloranta Reply-To: tatu@hypermall.net Organization: Linux-users missalie To: "Lucene Developers List" Subject: Re: Wildcard search and analyzers Date: Tue, 11 Feb 2003 20:53:33 -0700 User-Agent: KMail/1.4.3 References: In-Reply-To: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200302112053.33392.tatu@hypermall.net> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Tuesday 11 February 2003 13:13, Christoph Kiehl wrote: > Hi, > > I was just faced with the problem of wildcard queries not being analyzed. > While it might be a good idea to have no real stemming for wildcard terms, > I think one should do some basic filtering like convert to lowercase and > for german replacing �, �, � with vowels. Just curious... what would you replace those with? (I'm not saying there's no need for that, there are many cases where one might want to do it, just interested in this case) > For the english stemmer it might be a fix to simply lowercase these terms, > but with the german one it gets more complicated. I mean I could replace > these vowels manually, but that's a bad idea. IMHO we should separate > between real stemming and filtering, where filtering would also apply to > wildcard terms. > What do you think about that? Any better ideas? I don't think there's a one-size-fits-all solution, as using same analyzer for both normal and wildcard/prefix terms is unlikely to work well generally. One reason is that it's common to use short terms for prefixes, and those would be likely to be removed by stop word list. One easy solution would be to add couple of flags to QueryParser, for the most common cases (lowercasing at least). Since I was thinking of doing some small refactoring to QueryParser class (just making factory methods to replace new XxxQuery() constructors, so that those can be overridden), I can also refactor wildcard/prefix query term handling, so that it'll be easy to override handling (that is, one needs not modify Lucene source code but it'll be easy to extend parser class and redefine functionality)? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org