lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Stemming and Wildcard - or fire and water
Date Tue, 11 Dec 2012 10:04:07 GMT
This is a well-known problem: Wildcards cannot be analyzed by the query parser, because the
analysis would destroy the wildcard characters; also stemming of parts of terms will never
work. For Solr there is a workaround (MultiTermAware component), but it is also very limited
and only works when all analysis components are MultiTermAware, what stemmers are not.


Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: Bayer Dennis []
> Sent: Tuesday, December 11, 2012 10:50 AM
> To:
> Subject: Stemming and Wildcard - or fire and water
> Hello there,
> my colleague and I ran into an example which didn't return the result size
> which we were expecting. We discovered that there is a mismatch in
> handling terms while indexing and searching. This issue is already discussed
> several times in the internet as we found out later on, but in our point of
> view it's a buggy behavior if, at least, using a German stemmer.
> Tl;dr: a Junit testcase is available (
> Setup:
> * Lucene 4.0.0
> * Use the GermanAnalyzer which internally uses a GermanStemmer
> Issue:
> * Create an index for "Hersener" which has a common ending in German ->
> the string is shortend to "hers"
> * Search for "Hers" -> a result is found
> * Search for "Hersen" -> a result is found because the input token is also
> stemmed to "hers"
> * Search for "Hers*" -> a result is found
> * Search for "Hersen*" -> nothing is found because the analyzer does not
> run
> Similiar examples can be constructed easily if umlauts are involved.
> Conclusion:
> The search query which contains a wildcard should also be run through the
> analyzer, because there are a lot of queries which would return nothing. The
> lucene FAQ already as a topic related to this issue:
> java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen
> sitive.3F
> The example with "dog" and "dogs" works as long as only one character is
> stemmed - which could be true in English for the majority. But if more
> characters are involved lucene does not return anything instead of returning
> a few additional items. Just consider "families" which is stemmed to "famili".
> Searching for "familie*" wouldn't return no item.
> To find an ending for this initial post ;) :
> Could this behavior made configurable in the standard? If not:
> a) Why are the stemmers used by default if they can led to wrong results?
> b) What can be done manually to stem queries containing wildcards, e.g.
> overriding some parser.
> Best regards
> Dennis

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message