lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Francisco Alvarez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.
Date Thu, 13 Sep 2012 09:04:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454753#comment-13454753
] 

Francisco Alvarez commented on LUCENE-2667:
-------------------------------------------

The Fuzzy Search functionality has been dramatically limited with this new implementation.

Before it was possible to search with edit distances higher than 2, which is really necessary
in many situations.

We have tried to increase the MAXIMUM_SUPPORTED_DISTANCE value but got the following error:

java.lang.NullPointerException
     at org.apache.lucene.util.automaton.UTF32ToUTF8.convert(UTF32ToUTF8.java:259)
     at org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:163)
     at org.apache.lucene.search.FuzzyTermsEnum.initAutomata(FuzzyTermsEnum.java:182)
     at org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:153)
     at org.apache.lucene.search.FuzzyTermsEnum.maxEditDistanceChanged(FuzzyTermsEnum.java:217)

We need a solution for fuzzy searches higher than 2 edit distances to keep consistent behaviour
with Lucene 3.x


                
> Fix FuzzyQuery's defaults, so its fast.
> ---------------------------------------
>
>                 Key: LUCENE-2667
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2667
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0-ALPHA
>
>         Attachments: LUCENE-2667_contrib.patch, LUCENE-2667.patch, LUCENE-2667.patch
>
>
> We worked a lot on FuzzyQuery, but you need to be a rocket scientist to ensure good results.
> The main problem is that the default distance is 0.5f, which doesn't take into account
the length of the string.
> To add insult to injury, the default number of expansions is 1024 (traditionally from
BooleanQuery maxClauseCount)
> I propose:
> * The syntax of FuzzyQuery is enhanced, so that you can specify raw edits too: such as
foobar~2 (all terms within 2 levenshtein edits of foobar). Previously if you specified any
amount >=1, you got IllegalArgumentException, so this won't break anyone. You can still
use foobar~0.5, and it works just as before
> * The default for minimumSimilarity then becomes LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE,
which is 2. This way if you just do foobar~, its always fast.
> * The size of the priority queue is reduced by default from 1024 to a much more reasonable
value: 50. This is what FuzzyLikeThis uses.
> I think its best to just change the defaults for this query, since it was so aweful before.
We can add notes in migrate.txt that if you care about using the old values, then you should
provide them explicitly, and you will get the same results!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message