lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <markharw...@yahoo.co.uk>
Subject Re: More fuzzy issues - encouraging bad spelling?
Date Thu, 23 Dec 2004 21:20:11 GMT
Thanks for the suggestions, Paul.

I've just tried a scheme using the max docFreq of the expanded terms as 
the docFreq shared by all expanded terms in their idf calculations 
(giving a lower, shared, IDF) and I'm still removing the coordination 
factor on the BooleanQuery that groups the term queries..
Results seem much more sensible than the existing way of handling fuzzy 
queries. Here are some example results:

Query: smith~
==============
New scheme top result: Smith Smith
New scheme top score: 1.0
Existing scheme top result: Smita Khurana
Existing scheme top score: 0.02


Query: pete~ smith~
==============
New Scheme top result: Peter Smith
New Scheme top score: 0.99
Existing Scheme top result: Morrissey Pete
Existing Scheme top score: 0.07

Query: David Harland~
==============
New scheme top result: David Harland
New scheme top score: 0.68
Existing scheme top result: David Burland
Existing scheme top score: 0.18


I've currently amended FuzzyQuery to create new subclasses of 
BooleanQuery and TermQuery which override the similarity methods coord 
(for BooleanQuery) and idf ( for TermQuery). This approach will need to 
be taken by the other multi-term queries.
Does this sound like the best way to do this?

Cheers
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message