Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 57162 invoked from network); 10 Feb 2010 16:38:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Feb 2010 16:38:01 -0000 Received: (qmail 4493 invoked by uid 500); 10 Feb 2010 16:38:00 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 4411 invoked by uid 500); 10 Feb 2010 16:38:00 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 4403 invoked by uid 99); 10 Feb 2010 16:38:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2010 16:38:00 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2010 16:37:49 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 1065C234C4AD for ; Wed, 10 Feb 2010 08:37:28 -0800 (PST) Message-ID: <1819888957.182721265819848065.JavaMail.jira@brutus.apache.org> Date: Wed, 10 Feb 2010 16:37:28 +0000 (UTC) From: "Uwe Schindler (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. In-Reply-To: <385144480.402931264105194406.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832055#action_12832055 ] Uwe Schindler commented on LUCENE-2230: --------------------------------------- Hi Fuad, Ok thanks for the explanation about the cache. But there should still be some eviction algorithm or at least a WeakHashmap so the JVM can cleanup the cache for unused fields. My problem with IndexReaders missing in the cache logic was that if you reopen the index, the BKTree contains terms no longer available and the FuzzyTermEnum enumerates terms that are no longer available. This is bad parctise and should not be done in query rewrite. So enumerating terms with no relation to a real term dict is not really supported by MultiTermQuery, but works for fuzzy, because it does not use the low level segment-based term enumeration and linkage to TermDocs. The new LUCENE-2258 issue needs no warmup, as it uses a different algorithm for the Levenstein algorithm and also does not scan the whole term dict. By the way, in flex also RegEx queries and Wildcard queries are speed up. The problem with trunk not having this automaton package used for that is the problem, that the AutomatonTermsEnum used for that needs to do lots of seeking in the TermsEnum, which is improved in flex and we do not want to do the work twice. Flex has a completely different API on the enum side, so TermEnumerations work different and so on. > Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. > ---------------------------------------------------------------- > > Key: LUCENE-2230 > URL: https://issues.apache.org/jira/browse/LUCENE-2230 > Project: Lucene - Java > Issue Type: Improvement > Affects Versions: 3.0 > Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. > New algo uses integer distances between objects. > Reporter: Fuad Efendi > Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java > > Original Estimate: 0.02h > Remaining Estimate: 0.02h > > W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973 > http://portal.acm.org/citation.cfm?doid=362003.362025 > I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google). > Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests). > Big list od distance implementations: > http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org