lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bayer Dennis <Dennis.Ba...@cursor.de>
Subject Stemming and Wildcard - or fire and water
Date Tue, 11 Dec 2012 09:49:44 GMT
Hello there,
my colleague and I ran into an example which didn't return the result size which we were expecting.
We discovered that there is a mismatch in handling terms while indexing and searching. This
issue is already discussed several times in the internet as we found out later on, but in
our point of view it's a buggy behavior if, at least, using a German stemmer.

Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)

Setup:
* Lucene 4.0.0
* Use the GermanAnalyzer which internally uses a GermanStemmer

Issue:
* Create an index for "Hersener" which has a common ending in German -> the string is shortend
to "hers"
* Search for "Hers" -> a result is found
* Search for "Hersen" -> a result is found because the input token is also stemmed to "hers"
* Search for "Hers*" -> a result is found
* Search for "Hersen*" -> nothing is found because the analyzer does not run

Similiar examples can be constructed easily if umlauts are involved.

Conclusion:
The search query which contains a wildcard should also be run through the analyzer, because
there are a lot of queries which would return nothing. The lucene FAQ already as a topic related
to this issue: http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F

The example with "dog" and "dogs" works as long as only one character is stemmed - which could
be true in English for the majority. But if more characters are involved lucene does not return
anything instead of returning a few additional items. Just consider "families" which is stemmed
to "famili". Searching for "familie*" wouldn't return no item.

To find an ending for this initial post ;) :
Could this behavior made configurable in the standard? If not:
a) Why are the stemmers used by default if they can led to wrong results?
b) What can be done manually to stem queries containing wildcards, e.g. overriding some parser.

Best regards
Dennis





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message