lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: Lowercasing wildcards - why?
Date Sat, 31 May 2003 01:19:11 GMT
On Friday 30 May 2003 09:55, Leo Galambos wrote:
> Ah, I got it. THX. In the good old days, the wildcards were used as a
> fix for missing stemming module. I am not sure if you can combine these
> two opposite approaches successfully. I see the following drawbacks of
> your solution.
>
> Example:
> built* (->built) could be changed to build* (no built, but ->builder,
> building, etc.), and precision will go down drastically.
>
> You probably use a stemmer with one important bug (a.k.a. feature) -
> overstemming, so here is another example:
> political* (->political, politically) is transformed to polic*
> (->policer, policy, policies, policement etc.) by Porter alg., and the
> precision is again affected drastically

Yes, this is the exact problem that was brought up last time this was 
discussed. It may not be a very common problem (most of the time stemming a 
wildcard part probably works ok, somebody had tried that), but still a 
potential one. And that's why default lower casing was added, as it solved 
one of FAQs. It is much more common that analyzer used for non-wildcard query 
does lower casing, than not, and thus default setting (which leads to having
to turn feature off by some users) seems to make sense.

More general problem then is that there's no real way to stem "foo?ar", or any 
non-prefix wildcard query, but that could be figured out by QueryParser if 
necessary.

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message