Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 87971 invoked from network); 18 Jun 2008 14:06:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Jun 2008 14:06:05 -0000 Received: (qmail 61988 invoked by uid 500); 18 Jun 2008 14:06:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61572 invoked by uid 500); 18 Jun 2008 14:05:59 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61561 invoked by uid 99); 18 Jun 2008 14:05:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2008 07:05:59 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.46.31] (HELO yw-out-2324.google.com) (74.125.46.31) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2008 14:05:07 +0000 Received: by yw-out-2324.google.com with SMTP id 3so134230ywj.5 for ; Wed, 18 Jun 2008 07:05:25 -0700 (PDT) Received: by 10.150.49.2 with SMTP id w2mr1176630ybw.27.1213797925230; Wed, 18 Jun 2008 07:05:25 -0700 (PDT) Received: from ?192.168.1.2? ( [84.0.44.86]) by mx.google.com with ESMTPS id 9sm3686740qbw.14.2008.06.18.07.05.23 (version=SSLv3 cipher=RC4-MD5); Wed, 18 Jun 2008 07:05:24 -0700 (PDT) Subject: Getting irrelevant results using fuzzy query From: =?ISO-8859-1?Q?L=E1szl=F3?= Monda To: java-user@lucene.apache.org Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-cctZvVgvQtsQrRyegZQe" Date: Wed, 18 Jun 2008 16:05:29 +0200 Message-Id: <1213797929.14253.64.camel@whisper.dnsalias.net> Mime-Version: 1.0 X-Mailer: Evolution 2.12.1 X-Virus-Checked: Checked by ClamAV on apache.org --=-cctZvVgvQtsQrRyegZQe Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable Hi List, I've been redirected from general@lucene.apache.org to here to discuss my issue. ---------- My original email ---------- I try to provide relevant results for the users of a lyrics site, even in the case of misspellings by indexing artist and songs with Lucene. The problem is that Lucene provides irrelevant search results. For example searching for "Coldplay" returns "Longplay" as the most relevant result. This is how I create individual documents: Document document =3D new Document(); document.add(new Field("artist", artist, Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field("song", song, Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field("path", path, Field.Store.YES, Field.Index.NO)); indexWriter.addDocument(document); And this is how I compose the actual query: BooleanQuery query =3D new BooleanQuery(); if (artist.length() > 0) { FuzzyQuery artist_query =3D new FuzzyQuery(new Term("artist", artist)); query.add(artist_query, BooleanClause.Occur.MUST); } if (song.length() > 0) { FuzzyQuery song_query =3D new FuzzyQuery(new Term("song", song)); query.add(song_query, BooleanClause.Occur.MUST); } Please let me know what's wrong, I'd like to make this work right. Thanks in advance! ---------- My reply to an answer ---------- On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote: > On Dienstag, 17. Juni 2008, L=E1szl=F3 Monda wrote: >=20 > > FuzzyQuery artist_query =3D new FuzzyQuery(new Term("artist", > > artist)); >=20 > You should try the FuzzyQuery constructor that takes a minimum similarity=20 > and a prefix length. The general problem is however, that the degree of=20 > similarity is only one factor. The other factors are the same as for other=20 > searches, e.g. the number of occurences of the term in the document and in=20 > the whole index. >=20 > You could try to write your own similarity implementation that disables all=20 > these factors, see > http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity= .html=20 I understand some essential concepts related to Lucene such as the Levenshtein distance and tokenization, but I really don't want to go this deep if it's not necessary. Since fuzzy searching is based on the Levenshtein distance, the distance between "coldplay" and "coldplay" is 0 and the distance between "coldplay" and "downplay" is 3 so how on earth is possible that when searching for "coldplay", Lucene returns "longplay"? This shouldn't happen regardless of the minimum similarity and prefix length factors. Additional info: Lucene seems to do the right thing when only few documents are present, but goes crazy when there is about 1.5 million documents in the index. --------------------------------------------------------------------- I hope that some of you can help me because I don't have any ideas what can be wrong here. Thanks in advance! --=20 Laci --=-cctZvVgvQtsQrRyegZQe Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQBIWRYp5t2qLX5n7ZgRAqchAJ92ua2149amdAVJpn1v6Oc+V4xqeQCeMhJe NpgXd8lBBzly/eKZHSscEK0= =S/84 -----END PGP SIGNATURE----- --=-cctZvVgvQtsQrRyegZQe--