lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity
Date Sun, 17 Aug 2008 14:37:44 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller updated LUCENE-1124:
--------------------------------

    Attachment: LUCENE-1124.patch

This optimization is correct. Highlights some interesting things about fuzzy query as well
i.e. if you put a minsim of 0.9, your query term *has* to be over 10 chars to have any hope
of getting a match. For the default of 0.5 its 2 chars, so in the common case the optimization
doesn't do much good, and you do have to pay for the check every time no matter what. For
larger minsim values though, this will turn a lot of fuzz queries into no ops.

- Mark

> short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the
holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.lucene@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite
should be able to completely skip all TermEnumeration in the event that the input token is
shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that
the math above is right however ... it's been a while since i looked at Levenstein distances
... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message