Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 87684 invoked from network); 28 Jun 2008 15:27:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Jun 2008 15:27:30 -0000 Received: (qmail 56091 invoked by uid 500); 28 Jun 2008 15:27:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 56049 invoked by uid 500); 28 Jun 2008 15:27:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 56037 invoked by uid 99); 28 Jun 2008 15:27:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Jun 2008 08:27:25 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [72.14.220.156] (HELO fg-out-1718.google.com) (72.14.220.156) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Jun 2008 15:26:32 +0000 Received: by fg-out-1718.google.com with SMTP id l27so473270fgb.27 for ; Sat, 28 Jun 2008 08:26:50 -0700 (PDT) Received: by 10.86.23.17 with SMTP id 17mr3645741fgw.44.1214666809747; Sat, 28 Jun 2008 08:26:49 -0700 (PDT) Received: from ?192.168.1.2? ( [84.0.34.71]) by mx.google.com with ESMTPS id l12sm5056811fgb.6.2008.06.28.08.26.48 (version=SSLv3 cipher=RC4-MD5); Sat, 28 Jun 2008 08:26:48 -0700 (PDT) Subject: Re: Getting irrelevant results using fuzzy query From: =?ISO-8859-1?Q?L=E1szl=F3?= Monda To: markharw00d@yahoo.co.uk Cc: java-user@lucene.apache.org In-Reply-To: <649083.95545.qm@web26005.mail.ukl.yahoo.com> References: <649083.95545.qm@web26005.mail.ukl.yahoo.com> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-AkFPxvLQQhHgvso72kof" Date: Sat, 28 Jun 2008 17:26:46 +0200 Message-Id: <1214666806.6161.106.camel@whisper.dnsalias.net> Mime-Version: 1.0 X-Mailer: Evolution 2.12.1 X-Virus-Checked: Checked by ClamAV on apache.org --=-AkFPxvLQQhHgvso72kof Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable On Mon, 2008-06-23 at 12:52 +0000, mark harwood wrote: > >>Could you tell me what's wrong here, please? >=20 > There are potentially a number of factors at play here. >=20 > Your use of FuzzyLikeThis is fine - just tried the code on my single-term= "Paul" query and as I outlined before it is doing a much better job of mat= ching (Paul~=3D results Paul,Paul,Paul....Phul rather than FuzzyQuery's Pau= l~=3D results Phul, Saul, Paulo , Paul, Paul.....) >=20 > Try the query on just the term artist:Coldplay and see the results. What = artists Does FuzzyLikeThis return vs FuzzyQuery? >=20 > If you aren't getting Coldplay as the first result from FuzzyLikeThis dou= ble check the content is indexed using the same analyzer that you pass to F= uzzyLikeThisQuery (your code below uses SimpleAnalyzer). If you indexed wit= h WhitespaceAnalyzer for example or as "UN_TOKENIZED the index and the quer= y differ so "Coldplay"!=3Dcoldplay. >=20 > I notice the song title in your original code is treated as a single term= in your query - is that how it is indexed? I can see that artist might pos= sibly make sense as a single term which gets fuzzy matched but song titles = are generally longer which means it may work better as a tokenized field. You were right, tokenization was the issue. Using TOKENIZED instead of UN_TOKENIZED immediately provided relevant results, event when using it with FuzzyQuery. Using FuzzyLikeThisQuery made the relevance much better, so I'm really happy with the results. Thank you very much! >=20 > Cheers > Mark >=20 >=20 > ----- Original Message ---- > From: L=E1szl=F3 Monda > To: java-user@lucene.apache.org > Cc: markharw00d@yahoo.co.uk > Sent: Monday, 23 June, 2008 1:11:50 PM > Subject: Re: Getting irrelevant results using fuzzy query >=20 > Thanks for your reply, Mark. >=20 >=20 >=20 > This was my original code for constructing my query using FuzzyQuery: >=20 > BooleanQuery query =3D new BooleanQuery(); > if (artist.length() > 0) { > FuzzyQuery artist_query =3D new FuzzyQuery(new Term("artist", > artist)); > query.add(artist_query, BooleanClause.Occur.MUST); > } > if (song.length() > 0) { > FuzzyQuery song_query =3D new FuzzyQuery(new Term("song", song)); > query.add(song_query, BooleanClause.Occur.MUST); > } >=20 >=20 >=20 > This is my first attempt to use FuzzyLikeThisQuery (with no success): >=20 > FuzzyLikeThisQuery query =3D new FuzzyLikeThisQuery(2, new > SimpleAnalyzer()); > if (artist.length() > 0) { > query.addTerms(artist, "artist", 0.5f, 0); > } > if (song.length() > 0) { > query.addTerms(song, "song", 0.5f, 0); > } >=20 >=20 >=20 > This is my second attempt to use FuzzyLikeThisQuery (with no success): >=20 > BooleanQuery query =3D new BooleanQuery(); > if (artist.length() > 0) { > FuzzyLikeThisQuery artist_query =3D new FuzzyLikeThisQuery(1, new > SimpleAnalyzer()); > artist_query.addTerms(artist, "artist", 0.5f, 0); > query.add(artist_query, BooleanClause.Occur.MUST); > } > if (song.length() > 0) { > FuzzyLikeThisQuery song_query =3D new FuzzyLikeThisQuery(1, new > SimpleAnalyzer()); > song_query.addTerms(song, "song", 0.5f, 0); > query.add(song_query, BooleanClause.Occur.MUST); > } >=20 >=20 >=20 > I think it's my lack of undersanding of the usage of FuzzyLikeThisQuery > that makes me getting irrelevant results. >=20 > Could you tell me what's wrong here, please? >=20 > Thank you. >=20 > On Mon, 2008-06-23 at 11:28 +0000, mark harwood wrote: > > >>I do have serious problems with the relevance of the results with fuz= zy queries. > >=20 > > Please take the time to read my response here: > >=20 > > http://www.gossamer-threads.com/lists/lucene/java-user/62050#62050 > >=20 > > I had a work colleague come up with exactly the same problem this week = and the solution is the same. > >=20 > > Just tested my index with a standard Lucene FuzzyQuery for "Paul~" - th= is gives "Phul", "Saul", and "Paulo" before ANY "Paul" records due to IDF i= ssues. > > Using FuzzyLikeThisQuery puts all the "Paul" records ahead of the varia= nts. > >=20 > >=20 > >=20 > > ----- Original Message ---- > > From: L=E1szl=F3 Monda > > To: java-user@lucene.apache.org > > Cc: lucenelist2007@danielnaber.de > > Sent: Monday, 23 June, 2008 12:10:05 PM > > Subject: Re: Getting irrelevant results using fuzzy query > >=20 > > On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote: > > > On Mittwoch, 18. Juni 2008, L=E1szl=F3 Monda wrote: > > >=20 > > > > Additional info: Lucene seems to do the right thing when only few > > > > documents are present, but goes crazy when there is about 1.5 milli= on > > > > documents in the index. > > >=20 > > > Lucene works well with more documents (currently using it with 9 mill= ion).=20 > > > but the fuzzy query requires iteration over all terms which makes thi= s=20 > > > query slow. This can be avoid by setting the prefixLength parameter o= f the=20 > > > FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram i= ndex,=20 > > > see the spellchecker in the contrib area. > >=20 > > Thanks for the suggestion, but I don't have any performance problems > > yet, but I do have serious problems with the relevance of the results > > with fuzzy queries. > >=20 --=20 Laci --=-AkFPxvLQQhHgvso72kof Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQBIZlg25t2qLX5n7ZgRAiinAJ45Y0dAfFZy9jXY9+rd88yMxobFCACg1/nF PQBERGsCyqTIDzA8QtF7178= =UVzx -----END PGP SIGNATURE----- --=-AkFPxvLQQhHgvso72kof--