lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Monda <l...@monda.hu>
Subject Re: Getting irrelevant results using fuzzy query
Date Tue, 17 Jun 2008 23:47:26 GMT
Hi Daniel,

On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:
> On Dienstag, 17. Juni 2008, László Monda wrote:
> 
> >     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
> > artist));
> 
> You should try the FuzzyQuery constructor that takes a minimum similarity 
> and a prefix length. The general problem is however, that the degree of 
> similarity is only one factor. The other factors are the same as for other 
> searches, e.g. the number of occurences of the term in the document and in 
> the whole index.
> 
> You could try to write your own similarity implementation that disables all 
> these factors, see
> http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html 

I understand some essential concepts related to Lucene such as the
Levenshtein distance and tokenization, but I really don't want to go
this deep if it's not necessary.

Since fuzzy searching is based on the Levenshtein distance, the distance
between "coldplay" and "coldplay" is 0 and the distance between
"coldplay" and "downplay" is 3 so how on earth is possible that when
searching for "coldplay", Lucene returns "longplay"?  This shouldn't
happen regardless of the minimum similarity and prefix length factors.

Additional info: Lucene seems to do the right thing when only few
documents are present, but goes crazy when there is about 1.5 million
documents in the index.

> BTW, In general, there's more traffic on the java-user list and you might 
> get more answers there.

Thanks for the suggestion, I might try java-user later.

> Regards
>  Daniel
> 
-- 
Laci  <http://monda.hu>


Mime
View raw message