lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Performance improvements for fuzzy queries ?
Date Thu, 08 Mar 2012 22:01:46 GMT
On 03/02/2012 15:01, Paul Taylor wrote:
>
> Using Lucene 3.5,  I created a query parser based on the dismax parser 
> but in order to get matches on misspellings ecetra I additionally do a 
> fuzzy search and a wildcard search
>
> http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java

>
>
> So a search for 'echo bunneymen' searches for over three fields 
> (alias, sortname, artist) and becomes dijunction searches on these and 
> phrase search
>
> custom(+((
> alias:echo~0.5^0.71999997 | alias:echo*^0.71999997 | alias:echo^0.9
> | sortname:echo~0.5^0.88000005 | sortname:echo*^0.88000005 | 
> sortname:echo^1.1
> | artist:echo~0.5^1.04 | artist:echo*^1.04 | artist:echo^1.3)~0.1
>  (
> alias:bunneymen~0.5^0.71999997 | alias:bunneymen*^0.71999997 | 
> alias:bunneymen^0.9
> | sortname:bunneymen~0.5^0.88000005 | sortname:bunneymen*^0.88000005 | 
> sortname:bunneymen^1.1
> | artist:bunneymen~0.5^1.04 | artist:bunneymen*^1.04 | 
> artist:bunneymen^1.3)~0.1)
>  (alias:"echo bunneymen"^0.2 | sortname:"echo bunneymen"^0.2 | 
> artist:"echo bunneymen"^0.2)~0.1)
>
> and it gives me exactly the results and scoring that I want, trouble 
> is that its TOO SLOW
>
> I tried using a different write mechanism as recommended new 
> MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100) but then it 
> doesn't consider the query idf which makes sense so that rare query 
> terms aren't boosted, but neither does it consider the idf or 
> field/norm of the matching document this seems wrong because this 
> still seem relevent, and more problematically the fuzzy query scores 
> are so much lower than normal
> and phrase matches, so it doesn't seem to work when using fuzzy 
> queries mixed in with other queries, is there a better option or even 
> some better documentation on the rewrite method so I can understand it 
> better.
>
> Alternatively, is there an analyzer I can use to analyse the fields 
> using the fuzzy/levenstein logic so I can do this at index time 
> instead then just use a normal term query with same analyzer instead 
> of a fuzzy query
>
> Paul
>
FYI turns out the performance problems were more to do with the fact 
that I hadn't changed prefixLength from zero , although I only did fuzzy 
queries when the term length was at least 4 characters I didn't realise 
that unless I set the prefix length to four this wouldn't prevent 
matching the query term to terms shorter than 4.

But interestingly just came across 
http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html 
so looking forward to the 4.0 release, whenever that happens


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message