lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dirk Goldhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
Date Fri, 05 Feb 2010 16:47:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830160#action_12830160
] 

Dirk Goldhan commented on LUCENE-2230:
--------------------------------------

I have tried your patch on lucene 3.0.0. I had to make a small change to get it work. 
In the current implementation the enumerator is before the first element. This is no problem
in lucene 2.9.1 as it simply does one more iteration in FuzzyQuery (rewrite). In lucene 3.0.0
FuzzyQuery directly accesses the first element of the enumeration. As this is null it simply
stops further processing of terms (if (t == null) break;). 

I have made a small change in the FuzzyTermEnumNEW class to assign the first element to the
enumerator during creation. I simply appended the following lines within the constructor:

		if (this.termIterator.hasNext()) {
			Term firstTerm = new Term(field, termMap.keySet().iterator().next());
			this.currentTerm = firstTerm;
		} 

Thank you for the patch.


> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and calculates
distance for each term. New BKTree structure improves performance in average 20 times when
distance is 1, and 3 times when distance is 3. I tested with index size several millions docs,
and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java,
FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
(Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html
seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster
(isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message