Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AF65BD216 for ; Tue, 11 Dec 2012 09:52:58 +0000 (UTC) Received: (qmail 53673 invoked by uid 500); 11 Dec 2012 09:52:56 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 53461 invoked by uid 500); 11 Dec 2012 09:52:56 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 47840 invoked by uid 99); 11 Dec 2012 09:50:14 -0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Dennis.Bayer@cursor.de designates 212.60.157.140 as permitted sender) From: Bayer Dennis To: "java-user@lucene.apache.org" Subject: Stemming and Wildcard - or fire and water Thread-Topic: Stemming and Wildcard - or fire and water Thread-Index: Ac3Um0xbGSw0V/cQSXm1+2dh3Yy5Sw== Date: Tue, 11 Dec 2012 09:49:44 +0000 Message-ID: Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.0.9.38] Content-Type: multipart/alternative; boundary="_000_F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9Hermescursorde_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9Hermescursorde_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hello there, my colleague and I ran into an example which didn't return the result size = which we were expecting. We discovered that there is a mismatch in handling= terms while indexing and searching. This issue is already discussed severa= l times in the internet as we found out later on, but in our point of view = it's a buggy behavior if, at least, using a German stemmer. Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k) Setup: * Lucene 4.0.0 * Use the GermanAnalyzer which internally uses a GermanStemmer Issue: * Create an index for "Hersener" which has a common ending in German -> the= string is shortend to "hers" * Search for "Hers" -> a result is found * Search for "Hersen" -> a result is found because the input token is also = stemmed to "hers" * Search for "Hers*" -> a result is found * Search for "Hersen*" -> nothing is found because the analyzer does not ru= n Similiar examples can be constructed easily if umlauts are involved. Conclusion: The search query which contains a wildcard should also be run through the a= nalyzer, because there are a lot of queries which would return nothing. The= lucene FAQ already as a topic related to this issue: http://wiki.apache.or= g/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_se= nsitive.3F The example with "dog" and "dogs" works as long as only one character is st= emmed - which could be true in English for the majority. But if more charac= ters are involved lucene does not return anything instead of returning a fe= w additional items. Just consider "families" which is stemmed to "famili". = Searching for "familie*" wouldn't return no item. To find an ending for this initial post ;) : Could this behavior made configurable in the standard? If not: a) Why are the stemmers used by default if they can led to wrong results? b) What can be done manually to stem queries containing wildcards, e.g. ove= rriding some parser. Best regards Dennis --_000_F603A0EB24D60B4C9F6AA35D69EFD33C0225DCD9Hermescursorde_--