lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Weingärtner <kai.weingaert...@widas.de>
Subject How to create a fuzzy suggest
Date Tue, 13 Jul 2010 23:11:02 GMT
Hello, 


I am trying to create a suggest search (search results are displayed while the user is entering
the query) for names, but the search should also give results if the given name just sounds
like an indexed name. However a perfect match should be ranked higher than a similar sounding
match. 


I looked at the SpellChecker contrib, but this AFAIK cannot handle incomplete names (edge
n-grams). 


So I came up with this idea and it would be great if anyone could tell me if that is sensible
or if there is a better way: 


I create an analyzer to be run on the full names, which does the following 
- lowercase 
- build edge n-grams 
put these terms in the field (this would handle correctly spelled input) 


- run soundex on the n-grams 
put there soundexed n-grams in the field as well 


The incoming query will then also run through this analyzer with an or-default. So a correct
spelling will match the normal n-grams plus the soundexed n-grams leading to a good score.
A missspelled name would still match the soundexed n-grams, leading to a somewhat lower score.



My current problem is that I don't know how to duplicate the tokens in the analyzer so I can
add them as normal n-grams and soundexed n-grams. I suppose the TeeSinkTokenFilter will get
me there, but I could not figure out how to add all tokens back in one stream. 


To recap, my questions are: Could this approach work to create a "fuzzy suggest"? How do I
use the TeeSinkTokenFilter to separate and recombine the tokenstream. 


I hope that was clear, thanks for your help! 

	

Kai 	




Regelung im Bezug auf Paragraph 37a Absatz 4 HGB: WidasConcepts GmbH,
Geschaeftsfuehrer: Thomas Widmann und Christian Kappert,
Gerichtsstand Pforzheim, Registernummer: HRB 511442, Umsatzsteueridentifikationsnummer: DE205851091

Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte Informationen.
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtuemlich erhalten haben,
informieren Sie bitte sofort den Absender und vernichten Sie diese Mail.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail sind nicht gestattet.

This e-mail may contain confidential and/or privileged information.
If you are not the intended recipient (or have received this e-mail in error) please
notify the sender immediately and destroy this e-mail.
Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly
forbidden.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message