Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 23696 invoked from network); 16 Feb 2010 09:50:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Feb 2010 09:50:20 -0000 Received: (qmail 84177 invoked by uid 500); 16 Feb 2010 09:50:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 84094 invoked by uid 500); 16 Feb 2010 09:50:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 84084 invoked by uid 99); 16 Feb 2010 09:50:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 09:50:17 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 09:50:06 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 72E0B45FFD1 for ; Tue, 16 Feb 2010 10:49:46 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id n0FkPqslo6-z for ; Tue, 16 Feb 2010 10:49:38 +0100 (CET) Received: from VEGA (unknown [134.102.249.84]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id D3DEE45FFC9 for ; Tue, 16 Feb 2010 10:49:37 +0100 (CET) From: "Uwe Schindler" To: References: <27594371.post@talk.nabble.com> <559052.37168.qm@web24703.mail.ird.yahoo.com> <27605395.post@talk.nabble.com> In-Reply-To: <27605395.post@talk.nabble.com> Subject: RE: Strange Fuzzyquery results scoring when using a low minimal distance Date: Tue, 16 Feb 2010 10:49:42 +0100 Message-ID: <004a01caaeed$5d7b14f0$18713ed0$@de> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-index: Acqu6A5bOiPxJmclRAihjRcdtuLtdwAA+EVw Content-Language: de X-Virus-Checked: Checked by ClamAV on apache.org The problem ist he following: The docFreq of the term "luc=C3=A9ne" is 2, all other terms have 1 = (because StandardAnalyzer lowercases everything). What happens is, that = terms with lower docFreq get a higher score in TermQuery. This score = overweighs the boosting done by FuzzyQuery (because you index is so = small). If you raise the minSimilarity a little bit, your query matches less = terms and the rewritten BooleanQuery contains less clauses. At some = point the score overweigh of the less frequent terms is no longer = relevant for the final score.=20 By the way, you can always look at the explain() results which informs = you about the scoring done. The fix is (applies only to trunk, see issue = https://issues.apache.org/jira/browse/LUCENE-124) to ignore scoring of = the TermQueries generated by Fuzzy and only look at the edit distance = (implemented by another MTQ.RewriteMode), that can be set with = FuzzyQuery.setRewriteMode(). ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: stefcl [mailto:stefatwork@gmail.com] > Sent: Tuesday, February 16, 2010 10:11 AM > To: java-user@lucene.apache.org > Subject: Re: Strange Fuzzyquery results scoring when using a low > minimal distance >=20 >=20 > Thanksa lot, > But I still don't understand why raising a little bit the min > similarity > change the ordering... >=20 >=20 >=20 > markharw00d wrote: > > > > This could be down to IDF ie "Lucane" is ranked higher because it is > rarer > > despite having worse edit distance. > > This is arguably a bug. > > See http://issues.apache.org/jira/browse/LUCENE-329 which discusses > this. > > You could try subclass QueryParser and override newFuzzyQuery to > return > > FuzzyLikeThisQuery (found in "contrib/queries") > > > > Cheers > > Mark > > > > > > > > ----- Original Message ---- > > From: stefcl > > To: java-user@lucene.apache.org > > Sent: Mon, 15 February, 2010 14:13:52 > > Subject: Strange Fuzzyquery results scoring when using a low minimal > > distance > > > > > > Hello, > > > > I'm using Lucene v3. > > Please consider the following spellings > > > > Lucene > > Luc=C3=A9ne > > luc=C3=A9ne > > Lucane > > Lucen > > > > When searching for "luc=C3=A9ne" among those words using a = FuzzyQuery > (with 0.5 > > edit distance), results show : > > > > 1. Lucene 1.0259752 > > 2. Lucane 1.0259752 > > 3. Luc=C3=A9ne 0.95660806 > > 4. luc=C3=A9ne 0.95660806 > > 5. Lucen 0.30779266 > > > > #4 is an exact match, why does it receive a lower score than = "Lucane" > > which > > contains one incorrect letter? > > > > Also, if you raise min similarity a bit higher (0.6 of above), > everything > > becomes normal : > > > > 1. Luc=C3=A9ne 1.0438477 > > 2. luc=C3=A9ne 1.0438477 > > 3. Lucene 0.97959816 > > 4. Lucane 0.97959816 > > > > > > Any idea? > > Thanks in advance... > > > > > > The code I use : > > > > /** > > * @param args the command line arguments > > */ > > public static void main(String[] args) throws IOException, > > ParseException > > { > > > > StandardAnalyzer analyzer =3D new > > StandardAnalyzer(Version.LUCENE_CURRENT); > > > > // TODO code application logic here > > Directory index =3D new RAMDirectory(); > > IndexWriter w =3D new IndexWriter(index, analyzer, true, > > IndexWriter.MaxFieldLength.UNLIMITED); > > > > addDoc(w, "Lucene"); > > addDoc(w, "Luc=C3=A9ne"); > > addDoc(w, "luc=C3=A9ne"); > > addDoc(w, "Lucane"); > > addDoc(w, "Lucen"); > > > > w.close(); > > > > FuzzyQuery q =3D new FuzzyQuery( new Term("title", = "luc=C3=A9ne") , > 0.5f > > ); > > > > // 3. search > > IndexSearcher searcher =3D new IndexSearcher(index); > > > > TopDocs collector =3D searcher.search(q, 10); > > ScoreDoc[] hits =3D collector.scoreDocs; > > > > // 4. display results > > System.out.println("Found " + hits.length + " hits."); > > for(int i =3D 0 ; i < hits.length; i++) > > { > > Document d =3D searcher.doc(hits[i].doc); > > System.out.println((i + 1) + ". " + d.get("title") + " > " + > > hits[i].score ); > > } > > > > // searcher can only be closed when there > > // is no need to access the documents any more. > > searcher.close(); > > } > > > > > > private static void addDoc(IndexWriter w, String value) throws > > IOException > > { > > Document doc =3D new Document(); > > doc.add(new Field("title", value, Field.Store.YES, > > Field.Index.ANALYZED)); > > w.addDocument(doc); > > } > > -- > > View this message in context: > > http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using- > a-low-minimal-distance-tp27594371p27594371.html > > Sent from the Lucene - Java Users mailing list archive at = Nabble.com. > > > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > >=20 > -- > View this message in context: = http://old.nabble.com/Strange-Fuzzyquery- > results-scoring-when-using-a-low-minimal-distance- > tp27594371p27605395.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org