lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars-Erik Aabech <...@markedspartner.no>
Subject RE: Stemming and Wildcard - or fire and water
Date Tue, 11 Dec 2012 10:55:12 GMT
A possible workaround could be to modify search terms with wildcard tokens by stemming them
manually and creating a new search string.
Searches for hersen* would be modified to hers* and return what you expect.
Con is of course that you search for more than you specified.

Lars-Erik

> -----Original Message-----
> From: Bayer Dennis [mailto:Dennis.Bayer@cursor.de]
> Sent: Tuesday, December 11, 2012 10:50 AM
> To: java-user@lucene.apache.org
> Subject: Stemming and Wildcard - or fire and water
> 
> Hello there,
> my colleague and I ran into an example which didn't return the result 
> size which we were expecting. We discovered that there is a mismatch 
> in handling terms while indexing and searching. This issue is already 
> discussed several times in the internet as we found out later on, but 
> in our point of view it's a buggy behavior if, at least, using a German stemmer.
> 
> Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)
> 
> Setup:
> * Lucene 4.0.0
> * Use the GermanAnalyzer which internally uses a GermanStemmer
> 
> Issue:
> * Create an index for "Hersener" which has a common ending in German 
> -> the string is shortend to "hers"
> * Search for "Hers" -> a result is found
> * Search for "Hersen" -> a result is found because the input token is 
> also stemmed to "hers"
> * Search for "Hers*" -> a result is found
> * Search for "Hersen*" -> nothing is found because the analyzer does 
> not run
> 
> Similiar examples can be constructed easily if umlauts are involved.
> 
> Conclusion:
> The search query which contains a wildcard should also be run through 
> the analyzer, because there are a lot of queries which would return 
> nothing. The lucene FAQ already as a topic related to this issue:
> http://wiki.apache.org/lucene-
> java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen
> sitive.3F
> 
> The example with "dog" and "dogs" works as long as only one character 
> is stemmed - which could be true in English for the majority. But if 
> more characters are involved lucene does not return anything instead 
> of returning a few additional items. Just consider "families" which is stemmed to "famili".
> Searching for "familie*" wouldn't return no item.
> 
> To find an ending for this initial post ;) :
> Could this behavior made configurable in the standard? If not:
> a) Why are the stemmers used by default if they can led to wrong results?
> b) What can be done manually to stem queries containing wildcards, e.g.
> overriding some parser.
> 
> Best regards
> Dennis
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message