Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of Dennis.Bayer@cursor.de
 designates 212.60.157.140 as permitted sender)
From: Bayer Dennis <Dennis.Bayer@cursor.de>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Subject: Stemming and Wildcard - or fire and water
Thread-Topic: Stemming and Wildcard - or fire and water
Thread-Index: Ac3Um0xbGSw0V/cQSXm1+2dh3Yy5Sw==
Date: Tue, 11 Dec 2012 09:49:44 +0000
Message-ID: <F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9@Hermes.cursor.de>
Accept-Language: de-DE, en-US
Content-Language: de-DE
Content-Type: multipart/alternative;
	boundary="_000_F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9Hermescursorde_"
MIME-Version: 1.0

--_000_F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9Hermescursorde_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hello there,
my colleague and I ran into an example which didn't return the result size =
which we were expecting. We discovered that there is a mismatch in handling=
 terms while indexing and searching. This issue is already discussed severa=
l times in the internet as we found out later on, but in our point of view =
it's a buggy behavior if, at least, using a German stemmer.

Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)

Setup:
* Lucene 4.0.0
* Use the GermanAnalyzer which internally uses a GermanStemmer

Issue:
* Create an index for "Hersener" which has a common ending in German -> the=
 string is shortend to "hers"
* Search for "Hers" -> a result is found
* Search for "Hersen" -> a result is found because the input token is also =
stemmed to "hers"
* Search for "Hers*" -> a result is found
* Search for "Hersen*" -> nothing is found because the analyzer does not ru=
n

Similiar examples can be constructed easily if umlauts are involved.

Conclusion:
The search query which contains a wildcard should also be run through the a=
nalyzer, because there are a lot of queries which would return nothing. The=
 lucene FAQ already as a topic related to this issue: http://wiki.apache.or=
g/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_se=
nsitive.3F

The example with "dog" and "dogs" works as long as only one character is st=
emmed - which could be true in English for the majority. But if more charac=
ters are involved lucene does not return anything instead of returning a fe=
w additional items. Just consider "families" which is stemmed to "famili". =
Searching for "familie*" wouldn't return no item.

To find an ending for this initial post ;) :
Could this behavior made configurable in the standard? If not:
a) Why are the stemmers used by default if they can led to wrong results?
b) What can be done manually to stem queries containing wildcards, e.g. ove=
rriding some parser.

Best regards
Dennis


--_000_F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9Hermescursorde_--