lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Klaus Nesbigall" <LLG...@gmx.de>
Subject AW: RE: Stemming and Wildcard - or fire and water
Date Fri, 04 Jan 2013 17:06:18 GMT
I've encountered the same problem and tried to use your workaround. But overwriting the parser
hasn't done the job.

I do not understand why the stemming is done anyway.
Uwe wrote 
> This is a well-known problem: Wildcards cannot be analyzed by the query 
> parser, because the analysis would destroy the wildcard characters; 
> also stemming of parts of terms will never work. 
> ...

The actual behavior doesn't work either.
The english word families will not be found in case the user types the query familie*
So why solve the problem by postulate one oppinion as right and another as wrong?
A simple flag which allows or suppresses the stemming would solve everyones problem. All who
have no need of change can use the old form, everyone else can set the appropriate flag.
If this problem is so well known, there seems to be the need for a clean solution to this.


> A possible workaround could be to modify search terms with wildcard 
> tokens by stemming them manually and creating a new search string.
> Searches for hersen* would be modified to hers* and return what you 
> expect.
> Con is of course that you search for more than you specified.
> 
> Lars-Erik
> 
> -----Original Message-----
> From: Bayer Dennis [mailto:Dennis.Bayer@cursor.de]
> Sent: Tuesday, December 11, 2012 10:50 AM
> To: java-user@lucene.apache.org
> Subject: Stemming and Wildcard - or fire and water
> 
> Hello there,
> my colleague and I ran into an example which didn't return the result 
> size which we were expecting. We discovered that there is a mismatch 
> in handling terms while indexing and searching. This issue is already 
> discussed several times in the internet as we found out later on, but 
> in our point of view it's a buggy behavior if, at least, using a German stemmer.
> 
> Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)
> 
> Setup:
> * Lucene 4.0.0
> * Use the GermanAnalyzer which internally uses a GermanStemmer
> 
> Issue:
> * Create an index for "Hersener" which has a common ending in German 
> -> the string is shortend to "hers"
> * Search for "Hers" -> a result is found
> * Search for "Hersen" -> a result is found because the input token is 
> also stemmed to "hers"
> * Search for "Hers*" -> a result is found
> * Search for "Hersen*" -> nothing is found because the analyzer does 
> not run
> 
> Similiar examples can be constructed easily if umlauts are involved.
> 
> Conclusion:
> The search query which contains a wildcard should also be run through 
> the analyzer, because there are a lot of queries which would return 
> nothing. The lucene FAQ already as a topic related to this issue:
> http://wiki.apache.org/lucene-
> java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen
> sitive.3F
> 
> The example with "dog" and "dogs" works as long as only one character 
> is stemmed - which could be true in English for the majority. But if 
> more characters are involved lucene does not return anything instead 
> of returning a few additional items. Just consider "families" which is stemmed to "famili".
> Searching for "familie*" wouldn't return no item.
> 
> To find an ending for this initial post ;) :
> Could this behavior made configurable in the standard? If not:
> a) Why are the stemmers used by default if they can led to wrong results?
> b) What can be done manually to stem queries containing wildcards, e.g.
> overriding some parser.
> 
> Best regards
> Dennis
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message