lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Monda <l...@monda.hu>
Subject Re: Getting irrelevant results using fuzzy query
Date Mon, 23 Jun 2008 11:21:29 GMT
Hi Mark,

On Wed, 2008-06-18 at 21:09 +0100, markharw00d wrote:
> This looks like it is related to an issue I first raised here:
>     http://markmail.org/message/37ywsemfudpos6uh
> 
> At the time I identified 2 issues with FuzzyQuery - that the usual 
> "coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.
> The coord factor got fixed but idf remains an issue in FuzzyQuery I beleive.
> There is a class I added to contrib/queries - "FuzzyLikeThisQuery", 
> which can be used as a replacement to FuzzyQuery and fixes the idf 
> issue. You can subclass the QueryParser to create an instance of that 
> rather than FuzzyQuery if required.

Frankly, I neighter understand the deeper anatomy of Lucene, nor do I
want to.  I'd expect Lucene to do the right thing with fuzzy queries and
it obviously doesn't do the right thing in my case.

I get clearly irrelevant results when I search for a word using fuzzy
query, that world is stored in my Lucene index and Lucene returns me a
word that is pretty different.  It seems to me a bug.

I've downloaded FuzzyQuery from SVN and replaced the references of the
Fuzzy class with it.  I got the same irrelevant results than before.

Do you have any ideas what's going on here?

> 
> Cheers,
> Mark
> 
> László Monda wrote:
> > Hi List,
> >
> > I've been redirected from general@lucene.apache.org to here to discuss
> > my issue.
> >
> >
> > ---------- My original email ----------
> >
> >
> > I try to provide relevant results for the users of a lyrics site, even
> > in the case of misspellings by indexing artist and songs with Lucene.
> >
> > The problem is that Lucene provides irrelevant search results.  For
> > example searching for "Coldplay" returns "Longplay" as the most relevant
> > result.
> >
> > This is how I create individual documents:
> >
> > Document document = new Document();
> > document.add(new Field("artist", artist, Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> > document.add(new Field("song", song, Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> > document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
> > indexWriter.addDocument(document);
> >
> > And this is how I compose the actual query:
> >
> > BooleanQuery query = new BooleanQuery();
> > if (artist.length() > 0) {
> >     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
> > artist));
> >     query.add(artist_query, BooleanClause.Occur.MUST);
> > }
> > if (song.length() > 0) {
> >     FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
> >     query.add(song_query, BooleanClause.Occur.MUST);
> > }
> >
> > Please let me know what's wrong, I'd like to make this work right.
> >
> > Thanks in advance!
> >
> >
> > ---------- My reply to an answer ----------
> >
> >
> > On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:
> >   
> >> On Dienstag, 17. Juni 2008, László Monda wrote:
> >>
> >>     
> >>>     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
> >>> artist));
> >>>       
> >> You should try the FuzzyQuery constructor that takes a minimum
> >>     
> > similarity 
> >   
> >> and a prefix length. The general problem is however, that the degree
> >>     
> > of 
> >   
> >> similarity is only one factor. The other factors are the same as for
> >>     
> > other 
> >   
> >> searches, e.g. the number of occurences of the term in the document
> >>     
> > and in 
> >   
> >> the whole index.
> >>
> >> You could try to write your own similarity implementation that
> >>     
> > disables all 
> >   
> >> these factors, see
> >>
> >>     
> > http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html

> >
> > I understand some essential concepts related to Lucene such as the
> > Levenshtein distance and tokenization, but I really don't want to go
> > this deep if it's not necessary.
> >
> > Since fuzzy searching is based on the Levenshtein distance, the distance
> > between "coldplay" and "coldplay" is 0 and the distance between
> > "coldplay" and "downplay" is 3 so how on earth is possible that when
> > searching for "coldplay", Lucene returns "longplay"?  This shouldn't
> > happen regardless of the minimum similarity and prefix length factors.
> >
> > Additional info: Lucene seems to do the right thing when only few
> > documents are present, but goes crazy when there is about 1.5 million
> > documents in the index.
> >
> >
> > ---------------------------------------------------------------------
> >
> >
> > I hope that some of you can help me because I don't have any ideas what
> > can be wrong here.
> >
> > Thanks in advance!
> >
> >   
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
-- 
Laci  <http://monda.hu>


Mime
View raw message