lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3846) Fuzzy suggester
Date Sun, 04 Mar 2012 18:17:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221961#comment-13221961
] 

Eks Dev commented on LUCENE-3846:
---------------------------------

awesome! FST/A went a long way.

Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher
freq than the "exact match"..." 

Frequency influence is normally slightly more complicated than "only more popular", depending
on search task user is facing. Only more popular helps if we assume user types it wrong and
our suggestions dictionary is always right. But in cases where you have user who types it
correctly, and collection contains errors you would cut all documents with "fuzzy". 

What I found works pretty good is considering this problem to be of nearest neighbor type.
Namely, 
task is to find closest matches to the query. Some are more and some less popular. Take for
example a case where user types "black dog" and our collection contains document "blaKC dog",
having frequency of blakc much lower than black, "only more popular" would miss this document.

What works out of the box pretty good is comparing frequency of query word and "candidate"
to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It
is based on the fact that typos are normally very seldom (if not, they should be treated as
synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other
way around. 

But as said, it depends what the task is.    


Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes,
swaps and transpositions at character(byte) level and optionally considering position of edit.
This brings precision++ if used properly, like in 
- "inserting/deleting silent h should cost less than other letters (thomas vs thomas)"  
- "Phonetics, swap "c" <-> "k" is less evil than default"
- "inserting s at the end... bug vs bugs"

Apart from that, I see absolutely nothing more one on earth can do better :)


Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really
does not permit any serious work in form of patches.     
                
> Fuzzy suggester
> ---------------
>
>                 Key: LUCENE-3846
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3846
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3846.patch
>
>
> Would be nice to have a suggester that can handle some fuzziness (like spell correction)
so that it's able to suggest completions that are "near" what you typed.
> As a first go at this, I implemented 1T (ie up to 1 edit, including a transposition),
except the first letter must be correct.
> But there is a penalty, ie, the "corrected" suggestion needs to have a much higher freq
than the "exact match" suggestion before it can compete.
> Still tons of nocommits, and somehow we should merge this / make it work with analyzing
suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message