Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 62479 invoked from network); 28 Sep 2010 13:51:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Sep 2010 13:51:03 -0000 Received: (qmail 75792 invoked by uid 500); 28 Sep 2010 13:51:02 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 75327 invoked by uid 500); 28 Sep 2010 13:51:00 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 75320 invoked by uid 99); 28 Sep 2010 13:50:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Sep 2010 13:50:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Sep 2010 13:50:57 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o8SDoY3m026702 for ; Tue, 28 Sep 2010 13:50:35 GMT Message-ID: <6450266.442791285681834909.JavaMail.jira@thor> Date: Tue, 28 Sep 2010 09:50:34 -0400 (EDT) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast. In-Reply-To: <32719752.394581285371992962.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2667: -------------------------------- Attachment: LUCENE-2667.patch here's an updated patch, i think this is ready to commit. * use integer calculations internally to avoid the tricky float stuff * i added tests for the case where the edit distance is greater than the word. previously, it was not possible to issue these type of queries, as noted in the enum {noformat} // this will return less than 0.0 when the edit distance is // greater than the number of characters in the shorter word. // but this was the formula that was previously used in FuzzyTermEnum, // so it has not been changed (even though minimumSimilarity must be // greater than 0.0) {noformat} * i removed a TODO, so the linear enum gets an optimization from the priority queue, in that it uses the updated maxEdits to quickly reject too long/too short terms. > Fix FuzzyQuery's defaults, so its fast. > --------------------------------------- > > Key: LUCENE-2667 > URL: https://issues.apache.org/jira/browse/LUCENE-2667 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 4.0 > Reporter: Robert Muir > Assignee: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-2667.patch, LUCENE-2667.patch > > > We worked a lot on FuzzyQuery, but you need to be a rocket scientist to ensure good results. > The main problem is that the default distance is 0.5f, which doesn't take into account the length of the string. > To add insult to injury, the default number of expansions is 1024 (traditionally from BooleanQuery maxClauseCount) > I propose: > * The syntax of FuzzyQuery is enhanced, so that you can specify raw edits too: such as foobar~2 (all terms within 2 levenshtein edits of foobar). Previously if you specified any amount >=1, you got IllegalArgumentException, so this won't break anyone. You can still use foobar~0.5, and it works just as before > * The default for minimumSimilarity then becomes LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE, which is 2. This way if you just do foobar~, its always fast. > * The size of the priority queue is reduced by default from 1024 to a much more reasonable value: 50. This is what FuzzyLikeThis uses. > I think its best to just change the defaults for this query, since it was so aweful before. We can add notes in migrate.txt that if you care about using the old values, then you should provide them explicitly, and you will get the same results! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org