lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Monda <l...@monda.hu>
Subject Re: Getting irrelevant results using fuzzy query
Date Sat, 28 Jun 2008 15:26:46 GMT
On Mon, 2008-06-23 at 12:52 +0000, mark harwood wrote:
> >>Could you tell me what's wrong here, please?
> 
> There are potentially a number of factors at play here.
> 
> Your use of FuzzyLikeThis is fine - just tried the code on my single-term "Paul" query
and as I outlined before it is doing a much better job of matching (Paul~= results Paul,Paul,Paul....Phul
rather than FuzzyQuery's Paul~= results Phul, Saul, Paulo , Paul, Paul.....)
> 
> Try the query on just the term artist:Coldplay and see the results. What artists Does
FuzzyLikeThis  return vs FuzzyQuery?
> 
> If you aren't getting Coldplay as the first result from FuzzyLikeThis double check the
content is indexed using the same analyzer that you pass to FuzzyLikeThisQuery (your code
below uses SimpleAnalyzer). If you indexed with WhitespaceAnalyzer for example or as "UN_TOKENIZED
the index and the query differ so "Coldplay"!=coldplay.
> 
> I notice the song title in your original code is treated as a single term in your query
- is that how it is indexed? I can see that artist might possibly make sense as a single term
which gets fuzzy matched but song titles are generally longer which means it may work better
as a tokenized field.

You were right, tokenization was the issue.  Using TOKENIZED instead of
UN_TOKENIZED immediately provided relevant results, event when using it
with FuzzyQuery.

Using FuzzyLikeThisQuery made the relevance much better, so I'm really
happy with the results.

Thank you very much!

> 
> Cheers
> Mark
> 
> 
> ----- Original Message ----
> From: László Monda <laci@monda.hu>
> To: java-user@lucene.apache.org
> Cc: markharw00d@yahoo.co.uk
> Sent: Monday, 23 June, 2008 1:11:50 PM
> Subject: Re: Getting irrelevant results using fuzzy query
> 
> Thanks for your reply, Mark.
> 
> 
> 
> This was my original code for constructing my query using FuzzyQuery:
> 
> BooleanQuery query = new BooleanQuery();
> if (artist.length() > 0) {
>     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
> artist));
>     query.add(artist_query, BooleanClause.Occur.MUST);
> }
> if (song.length() > 0) {
>     FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
>     query.add(song_query, BooleanClause.Occur.MUST);
> }
> 
> 
> 
> This is my first attempt to use FuzzyLikeThisQuery (with no success):
> 
> FuzzyLikeThisQuery query = new FuzzyLikeThisQuery(2, new
> SimpleAnalyzer());
> if (artist.length() > 0) {
>     query.addTerms(artist, "artist", 0.5f, 0);
> }
> if (song.length() > 0) {
>     query.addTerms(song, "song", 0.5f, 0);
> }
> 
> 
> 
> This is my second attempt to use FuzzyLikeThisQuery (with no success):
> 
> BooleanQuery query = new BooleanQuery();
> if (artist.length() > 0) {
>     FuzzyLikeThisQuery artist_query = new FuzzyLikeThisQuery(1, new
> SimpleAnalyzer());
>     artist_query.addTerms(artist, "artist", 0.5f, 0);
>     query.add(artist_query, BooleanClause.Occur.MUST);
> }
> if (song.length() > 0) {
>     FuzzyLikeThisQuery song_query = new FuzzyLikeThisQuery(1, new
> SimpleAnalyzer());
>     song_query.addTerms(song, "song", 0.5f, 0);
>     query.add(song_query, BooleanClause.Occur.MUST);
> }
> 
> 
> 
> I think it's my lack of undersanding of the usage of FuzzyLikeThisQuery
> that makes me getting irrelevant results.
> 
> Could you tell me what's wrong here, please?
> 
> Thank you.
> 
> On Mon, 2008-06-23 at 11:28 +0000, mark harwood wrote:
> > >>I do have serious problems with the relevance of the results with fuzzy
queries.
> > 
> > Please take the time to read my response here:
> > 
> >      http://www.gossamer-threads.com/lists/lucene/java-user/62050#62050
> > 
> > I had a work colleague come up with exactly the same problem this week and the solution
is the same.
> > 
> > Just tested my index with a standard Lucene FuzzyQuery for "Paul~" - this gives
"Phul", "Saul", and "Paulo" before ANY "Paul" records due to IDF issues.
> > Using FuzzyLikeThisQuery puts all the "Paul" records ahead of the variants.
> > 
> > 
> > 
> > ----- Original Message ----
> > From: László Monda <laci@monda.hu>
> > To: java-user@lucene.apache.org
> > Cc: lucenelist2007@danielnaber.de
> > Sent: Monday, 23 June, 2008 12:10:05 PM
> > Subject: Re: Getting irrelevant results using fuzzy query
> > 
> > On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote:
> > > On Mittwoch, 18. Juni 2008, László Monda wrote:
> > > 
> > > > Additional info: Lucene seems to do the right thing when only few
> > > > documents are present, but goes crazy when there is about 1.5 million
> > > > documents in the index.
> > > 
> > > Lucene works well with more documents (currently using it with 9 million).

> > > but the fuzzy query requires iteration over all terms which makes this 
> > > query slow. This can be avoid by setting the prefixLength parameter of the

> > > FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,

> > > see the spellchecker in the contrib area.
> > 
> > Thanks for the suggestion, but I don't have any performance problems
> > yet, but I do have serious problems with the relevance of the results
> > with fuzzy queries.
> > 
-- 
Laci  <http://monda.hu>


Mime
View raw message