lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Stemming and Wildcard - or fire and water
Date Tue, 11 Dec 2012 10:04:07 GMT
This is a well-known problem: Wildcards cannot be analyzed by the query parser, because the
analysis would destroy the wildcard characters; also stemming of parts of terms will never
work. For Solr there is a workaround (MultiTermAware component), but it is also very limited
and only works when all analysis components are MultiTermAware, what stemmers are not.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bayer Dennis [mailto:Dennis.Bayer@cursor.de]
> Sent: Tuesday, December 11, 2012 10:50 AM
> To: java-user@lucene.apache.org
> Subject: Stemming and Wildcard - or fire and water
> 
> Hello there,
> my colleague and I ran into an example which didn't return the result size
> which we were expecting. We discovered that there is a mismatch in
> handling terms while indexing and searching. This issue is already discussed
> several times in the internet as we found out later on, but in our point of
> view it's a buggy behavior if, at least, using a German stemmer.
> 
> Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)
> 
> Setup:
> * Lucene 4.0.0
> * Use the GermanAnalyzer which internally uses a GermanStemmer
> 
> Issue:
> * Create an index for "Hersener" which has a common ending in German ->
> the string is shortend to "hers"
> * Search for "Hers" -> a result is found
> * Search for "Hersen" -> a result is found because the input token is also
> stemmed to "hers"
> * Search for "Hers*" -> a result is found
> * Search for "Hersen*" -> nothing is found because the analyzer does not
> run
> 
> Similiar examples can be constructed easily if umlauts are involved.
> 
> Conclusion:
> The search query which contains a wildcard should also be run through the
> analyzer, because there are a lot of queries which would return nothing. The
> lucene FAQ already as a topic related to this issue:
> http://wiki.apache.org/lucene-
> java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen
> sitive.3F
> 
> The example with "dog" and "dogs" works as long as only one character is
> stemmed - which could be true in English for the majority. But if more
> characters are involved lucene does not return anything instead of returning
> a few additional items. Just consider "families" which is stemmed to "famili".
> Searching for "familie*" wouldn't return no item.
> 
> To find an ending for this initial post ;) :
> Could this behavior made configurable in the standard? If not:
> a) Why are the stemmers used by default if they can led to wrong results?
> b) What can be done manually to stem queries containing wildcards, e.g.
> overriding some parser.
> 
> Best regards
> Dennis
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message