Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Content-Type: text/plain;
  charset="iso-8859-1"
From: Tatu Saloranta <tatu@hypermall.net>
Reply-To: tatu@hypermall.net
Organization: Linux-users missalie
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Subject: Re: Wildcard search and analyzers
Date: Tue, 11 Feb 2003 20:53:33 -0700
User-Agent: KMail/1.4.3
References: <b2ble7$cp3$1@main.gmane.org>
In-Reply-To: <b2ble7$cp3$1@main.gmane.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: <200302112053.33392.tatu@hypermall.net>

On Tuesday 11 February 2003 13:13, Christoph Kiehl wrote:
> Hi,
>
> I was just faced with the problem of wildcard queries not being analyzed.
> While it might be a good idea to have no real stemming for wildcard terms,
> I think one should do some basic filtering like convert to lowercase and
> for german replacing �, �, � with vowels.

Just curious... what would you replace those with? (I'm not saying there's no 
need for that, there are many cases where one might want to do it, just 
interested in this case)

> For the english stemmer it might be a fix to simply lowercase these terms,
> but with the german one it gets more complicated. I mean I could replace
> these vowels manually, but that's a bad idea. IMHO we should separate
> between real stemming and filtering, where filtering would also apply to
> wildcard terms.
> What do you think about that? Any better ideas?

I don't think there's a one-size-fits-all solution, as using same analyzer for 
both normal and wildcard/prefix terms is unlikely to work well generally.
One reason is that it's common to use short terms for prefixes, and those 
would be likely to be removed by stop word list.

One easy solution would be to add couple of flags to QueryParser, for the most 
common cases (lowercasing at least).

Since I was thinking of doing some small refactoring to QueryParser class 
(just making factory methods to replace new XxxQuery() constructors, so that 
those can be overridden), I can also refactor wildcard/prefix query term 
handling, so that it'll be easy to override handling (that is, one needs not 
modify Lucene source code but it'll be easy to extend parser class and 
redefine functionality)?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org