Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E813DC7F for ; Thu, 13 Sep 2012 09:17:09 +0000 (UTC) Received: (qmail 15648 invoked by uid 500); 13 Sep 2012 09:17:08 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 15585 invoked by uid 500); 13 Sep 2012 09:17:07 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 15573 invoked by uid 99); 13 Sep 2012 09:17:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2012 09:17:07 +0000 Date: Thu, 13 Sep 2012 20:17:07 +1100 (NCT) From: "Uwe Schindler (JIRA)" To: dev@lucene.apache.org Message-ID: <1574581308.74146.1347527827817.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454759#comment-13454759 ] Uwe Schindler commented on LUCENE-2667: --------------------------------------- Hi Francisco: The core FuzzyQuery does not support edit distances > 2, because the automatons used for this would be too big and slow. If you really want distances > 2, use http://lucene.apache.org/core/4_0_0-BETA/sandbox/org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html from the sandbox module (lucene-sandbox.jar). This one is the same algorithm as the old 3.x FuzzyQuery (and is as slow). > Fix FuzzyQuery's defaults, so its fast. > --------------------------------------- > > Key: LUCENE-2667 > URL: https://issues.apache.org/jira/browse/LUCENE-2667 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Affects Versions: 4.0-ALPHA > Reporter: Robert Muir > Assignee: Robert Muir > Fix For: 4.0-ALPHA > > Attachments: LUCENE-2667_contrib.patch, LUCENE-2667.patch, LUCENE-2667.patch > > > We worked a lot on FuzzyQuery, but you need to be a rocket scientist to ensure good results. > The main problem is that the default distance is 0.5f, which doesn't take into account the length of the string. > To add insult to injury, the default number of expansions is 1024 (traditionally from BooleanQuery maxClauseCount) > I propose: > * The syntax of FuzzyQuery is enhanced, so that you can specify raw edits too: such as foobar~2 (all terms within 2 levenshtein edits of foobar). Previously if you specified any amount >=1, you got IllegalArgumentException, so this won't break anyone. You can still use foobar~0.5, and it works just as before > * The default for minimumSimilarity then becomes LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE, which is 2. This way if you just do foobar~, its always fast. > * The size of the priority queue is reduced by default from 1024 to a much more reasonable value: 50. This is what FuzzyLikeThis uses. > I think its best to just change the defaults for this query, since it was so aweful before. We can add notes in migrate.txt that if you care about using the old values, then you should provide them explicitly, and you will get the same results! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org