lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3846) Fuzzy suggester
Date Sun, 04 Mar 2012 21:26:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222014#comment-13222014
] 

Eks Dev commented on LUCENE-3846:
---------------------------------

{quote}
feel free to show me evidence they do
{quote}

Even here they help a lot, do not underestimate error model! (as in noisy channel, see http://norvig.com/spell-correct.html
for a nice overview).

Examples, off the top of my head:
in a case you search for Carin in a set {Karin, Marin, Darin}, (All valid names, at edit distance
one) you would prefer to see Karin as a highest (to the only one) ranked fuzzy suggestion.
(close consonants).

Or discount on swap(vowel ,vowel) vs swap(vowel/consonant, consonant). Mistaking one vowel
for another is more probable than mistaking two consonants or consonant and vowel (as long
as humans type). 

Books, scanned using OCR have no problems with phonetics, but other...

Context is important, in-word context as part of "error model" (character level context, like
previous character) but even more important is the context from  the "language model", that
normally dominates. 

I could look for some interesting papers in my archives if you are not convinced yet :)
This one is worth reading (http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf), tackles, among other
things, exactly this topic. 

{quote}
it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on
past matched characters, though not [easily] future ones).
{quote}
 
Great to hear that!  
prefix based context is the only context at sub-word level I ever used. I doubt lookahead
brings something. 

                
> Fuzzy suggester
> ---------------
>
>                 Key: LUCENE-3846
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3846
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3846.patch
>
>
> Would be nice to have a suggester that can handle some fuzziness (like spell correction)
so that it's able to suggest completions that are "near" what you typed.
> As a first go at this, I implemented 1T (ie up to 1 edit, including a transposition),
except the first letter must be correct.
> But there is a penalty, ie, the "corrected" suggestion needs to have a much higher freq
than the "exact match" suggestion before it can compete.
> Still tons of nocommits, and somehow we should merge this / make it work with analyzing
suggester too (LUCENE-3842).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message