lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hadfield <m...@animarc.com>
Subject Re: Funny results with Fuzzy
Date Tue, 25 Oct 2005 17:43:47 GMT


hello -

a fuzzy query related question:

has there been any other implementations of "fuzzy" queries other than 
edit-distance?  and/or modifications of edit-distance to less penalize 
common alternate spellings? - i.e. "couldn't" vs. "couldnt" -- here the 
apostrophe would get a smaller penalty than character mismatch.

i'm thinking specifically of the algorithms in the SecondString open 
source package:
http://secondstring.sourceforge.net/

what do you think the difficulty would be to wrap an alternate algorithm 
that provides a:
float score(String1, String2)
function?


---marc

mark harwood wrote:

>>One thing I was thinking of doing was checking the
>>character frequency 
>>    
>>
>
>An alternative idea is index-time fuzzification rather
>than query-time. This is documented in one of the case
>studies in LIA - the principle is you don't
>index/search for whole words but use an NGram Analyzer
>to break them up at index time:
>
>Kylie becomes multiple words:
>[ k]
>[ ky]
>[ kyl]
>[ky]
>[kyl]
>[kyli]
>[yl]
>[yli]
>[ylie]
>[ kylie ]
>
>Obviously you use the same analyzer to process
>queries.
>Lucene will automatically look after relevancy of
>partial matches for you but your indexes are bigger
>and your queries will generate many more Boolean
>clauses.
>
>
>
>
>
>	
>	
>		
>___________________________________________________________ 
>Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message