lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Rothenberg <a.rothenb...@fotofinder.net>
Subject Re: How to create a fuzzy suggest
Date Wed, 14 Jul 2010 10:13:28 GMT
Hi, 
i had a similar need to create somethink that acts not like a "filter" 
or "tokenizer" but only inserts self-generated tokens into the token-stream. 
(my purpose was to generate all kinds of word-forms for german umlauts...)

the following code-base helped me a lot to create it: 
http://207.44.206.178/message.jspa?messageID=91989#91991

the synonym-filter also adds tokens into the tokenstream

regards, Alex



On Wednesday 14 July 2010 01:11:02 Kai Weingärtner wrote:
> Hello,
>
>
> I am trying to create a suggest search (search results are displayed while
> the user is entering the query) for names, but the search should also give
> results if the given name just sounds like an indexed name. However a
> perfect match should be ranked higher than a similar sounding match.
>
>
> I looked at the SpellChecker contrib, but this AFAIK cannot handle
> incomplete names (edge n-grams).
>
>
> So I came up with this idea and it would be great if anyone could tell me
> if that is sensible or if there is a better way:
>
>
> I create an analyzer to be run on the full names, which does the following
> - lowercase
> - build edge n-grams
> put these terms in the field (this would handle correctly spelled input)
>
>
> - run soundex on the n-grams
> put there soundexed n-grams in the field as well
>
>
> The incoming query will then also run through this analyzer with an
> or-default. So a correct spelling will match the normal n-grams plus the
> soundexed n-grams leading to a good score. A missspelled name would still
> match the soundexed n-grams, leading to a somewhat lower score.
>
>
> My current problem is that I don't know how to duplicate the tokens in the
> analyzer so I can add them as normal n-grams and soundexed n-grams. I
> suppose the TeeSinkTokenFilter will get me there, but I could not figure
> out how to add all tokens back in one stream.
>
>
> To recap, my questions are: Could this approach work to create a "fuzzy
> suggest"? How do I use the TeeSinkTokenFilter to separate and recombine the
> tokenstream.
>
>
> I hope that was clear, thanks for your help!
>
>
>
> Kai
>
>
>
>
> Regelung im Bezug auf Paragraph 37a Absatz 4 HGB: WidasConcepts GmbH,
> Geschaeftsfuehrer: Thomas Widmann und Christian Kappert,
> Gerichtsstand Pforzheim, Registernummer: HRB 511442,
> Umsatzsteueridentifikationsnummer: DE205851091
>
> Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte
> Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
> irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und
> vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
> Weitergabe dieser Mail sind nicht gestattet.
>
> This e-mail may contain confidential and/or privileged information.
> If you are not the intended recipient (or have received this e-mail in
> error) please notify the sender immediately and destroy this e-mail.
> Any unauthorized copying, disclosure or distribution of the material in
> this e-mail is strictly forbidden.



-- 
Alexander Rothenberg
Fotofinder GmbH		USt-IdNr. DE812854514
Software Entwicklung	Web: http://www.fotofinder.net/
Potsdamer Str. 96	Tel: +49 30 25792890
10785 Berlin		Fax: +49 30 257928999

Geschäftsführer:	Ali Paczensky
Amtsgericht:		Berlin Charlottenburg (HRB 73099)
Sitz:			Berlin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message