Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 2759 invoked from network); 29 Mar 2010 14:58:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Mar 2010 14:58:15 -0000 Received: (qmail 62068 invoked by uid 500); 29 Mar 2010 14:58:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 62016 invoked by uid 500); 29 Mar 2010 14:58:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 62008 invoked by uid 99); 29 Mar 2010 14:58:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Mar 2010 14:58:13 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [78.46.200.211] (HELO mx1.cluster1.pyrox.eu) (78.46.200.211) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Mar 2010 14:58:06 +0000 Received: from roundcube.pyrox.eu (web1.cluster1.pyrox.eu [10.1.1.6]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.cluster1.pyrox.eu (PX-MAIL) with ESMTPSA id 04C071804A; Mon, 29 Mar 2010 16:57:44 +0200 (CEST) MIME-Version: 1.0 Date: Mon, 29 Mar 2010 16:57:44 +0200 From: Benjamin Patrick Jung To: Subject: Problem / question concerning "Fuzzy Search" Organization: Terreon Message-ID: X-Sender: bpjung@terreon.de User-Agent: RoundCube Webmail/0.3.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Hi all, I tried to figure out how the fuzzy search implementation in Apache Lucene works and I'm kinda stuck here. --> Version : Apache Lucene 3.0.1 (JAVA) [What I want / need] I'm looking for a way to combine a prefix-, fuzzy- and wildcard query. Q: Is it possible to have a query like "user_input~0.5*" ? [JavaDoc for org.apache.lucene.search.FuzzyQuery c-tor] @param minimumSimilarity: a value between 0 and 1 to set the required similarity between the query term and the matching terms. For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term)*0.5 Q: Mh... what if the query term differs in it's length to the term in my document? [Test case] I have written a small test program (JUnit test case) to explain my problem / confusion in detail: --> http://eugeneciurana.com/pastebin/pastebin.php?show=42619 [Examples] Search term --> Subset of expected result Cinamo~0.5 --> Cinema, Cinnamon [works] Strawbarr~0.8 --> Strawberry [doesn't work] --> As far as I understand, the "Edit distance" (aka "Levinshtein distance") between "Strawbarr" and "Strawberry" is 2 (one replacement and one insertion to transform "Strawbarr" into "Strawberry") The query "Strawbarr~0.8" in my opinion (and from what I read from the JavaDocs) should work just fine, because len(Strawbarr)*0.8 == 9*0.8 == 7.2 ... 7.2 >= 2 ... still -- it doesn't work. Is that, because the length of the search term and the word in my document differ? I already searched the wiki, the mailing list archive and had a look in all the "obvious" places but had no luck so far. If I am missing something obvious here I would be glad to receive some pointers into the right direction. <-- Regards -benjamin- -- Benjamin Jung Terreon, http://terreon.de/ Tel.: +49 (0)69 / 8484 65 37 Fax: +49 (0)6054 / 909 788 2 Mobil +49 (0)1577 / 159 788 3 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org