lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christoph Kiehl">
Subject Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
Date Thu, 13 Feb 2003 12:49:46 GMT
Tatu Saloranta wrote:

> - Stemming can only be done for prefix queries (what is stem of,
>   say, "hä*er"?), and even then it might not produce stem one would
>   want. For example, for prefix query "men*" might be 'stemmed' to
>   "man*", and user might be perplexed at why documents with
>   words like "meningitis" and "menstrual" did not match (ok, that is
>   a contrived example, but hope you get the idea).

Good point. It's is really amazing how different and complex languages are

>  In a way, you could think that user is doing "manual stemming", using
>  a stem of a word with prefix query.

Yup, but for example the german word "Möllemann" is a surname so there is
nothing to stem. If you search for "möllema*" now, you won't get any results
because "Möllemann" is indexed as "mollemann". Ok, I admit it would be
uncommon to search for "möllema*" if you want to find occurences of
"Möllemann". But this is only an example.

> In case of german, if umlaut chars are typically converted, perhaps
> you could create a that just extends default
> query parser, and does necessary transformation for wildcard/prefix
> queries? Since there already exists separate language-dependant
> stemmers,  this might make sense?

Yep, this would be worth a try. But I'm not sure if this really beats all
problems. I'm still trying to get a whole picture of the problem ;)

>> is also the problem with Umlauts (ä,ö,ü) turned into vowels (a,o,u)
>> while indexing. An example: "Häuser" is the plural of "Haus". If I
>> index "Häuser" it is stemmed to "hau". If I do for example a search
>> for "häus*" nothing is
> Not "haus"?

I meant "häus*", but I admit it would be more natural searching for haus ;).
Perhaps I'm trying to find problems where there are none ;). But it really
depends on how you use Lucene.

>> and defaults to off. Hope you understand my problem ;)
> Yes I do... I don't even dare to think of problems finnish analyzer
> might have, with stemming. :-)


A bit confused

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message