lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <markharw...@yahoo.co.uk>
Subject Re: Getting irrelevant results using fuzzy query
Date Wed, 18 Jun 2008 20:09:02 GMT
This looks like it is related to an issue I first raised here:
    http://markmail.org/message/37ywsemfudpos6uh

At the time I identified 2 issues with FuzzyQuery - that the usual 
"coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.
The coord factor got fixed but idf remains an issue in FuzzyQuery I beleive.
There is a class I added to contrib/queries - "FuzzyLikeThisQuery", 
which can be used as a replacement to FuzzyQuery and fixes the idf 
issue. You can subclass the QueryParser to create an instance of that 
rather than FuzzyQuery if required.

Cheers,
Mark

László Monda wrote:
> Hi List,
>
> I've been redirected from general@lucene.apache.org to here to discuss
> my issue.
>
>
> ---------- My original email ----------
>
>
> I try to provide relevant results for the users of a lyrics site, even
> in the case of misspellings by indexing artist and songs with Lucene.
>
> The problem is that Lucene provides irrelevant search results.  For
> example searching for "Coldplay" returns "Longplay" as the most relevant
> result.
>
> This is how I create individual documents:
>
> Document document = new Document();
> document.add(new Field("artist", artist, Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> document.add(new Field("song", song, Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
> indexWriter.addDocument(document);
>
> And this is how I compose the actual query:
>
> BooleanQuery query = new BooleanQuery();
> if (artist.length() > 0) {
>     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
> artist));
>     query.add(artist_query, BooleanClause.Occur.MUST);
> }
> if (song.length() > 0) {
>     FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
>     query.add(song_query, BooleanClause.Occur.MUST);
> }
>
> Please let me know what's wrong, I'd like to make this work right.
>
> Thanks in advance!
>
>
> ---------- My reply to an answer ----------
>
>
> On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:
>   
>> On Dienstag, 17. Juni 2008, László Monda wrote:
>>
>>     
>>>     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
>>> artist));
>>>       
>> You should try the FuzzyQuery constructor that takes a minimum
>>     
> similarity 
>   
>> and a prefix length. The general problem is however, that the degree
>>     
> of 
>   
>> similarity is only one factor. The other factors are the same as for
>>     
> other 
>   
>> searches, e.g. the number of occurences of the term in the document
>>     
> and in 
>   
>> the whole index.
>>
>> You could try to write your own similarity implementation that
>>     
> disables all 
>   
>> these factors, see
>>
>>     
> http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html 
>
> I understand some essential concepts related to Lucene such as the
> Levenshtein distance and tokenization, but I really don't want to go
> this deep if it's not necessary.
>
> Since fuzzy searching is based on the Levenshtein distance, the distance
> between "coldplay" and "coldplay" is 0 and the distance between
> "coldplay" and "downplay" is 3 so how on earth is possible that when
> searching for "coldplay", Lucene returns "longplay"?  This shouldn't
> happen regardless of the minimum similarity and prefix length factors.
>
> Additional info: Lucene seems to do the right thing when only few
> documents are present, but goes crazy when there is about 1.5 million
> documents in the index.
>
>
> ---------------------------------------------------------------------
>
>
> I hope that some of you can help me because I don't have any ideas what
> can be wrong here.
>
> Thanks in advance!
>
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message