lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
Date Wed, 10 Feb 2010 17:56:32 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832096#action_12832096
] 

Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 5:56 PM:
--------------------------------------------------------------

Hi Uwe,


I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new
Distance()). It is a structure for fast lookup of close objects from a set of objects, with
predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from
Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm>
is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting
end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...



BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only
if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein
distance is slow...

      was (Author: funtick):
    Hi Uwe,


I am trying to study Lucene-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new
Distance()). It is a structure for fast lookup of close objects from a set of objects, with
predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from
Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm>
is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting
end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...



BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only
if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein
distance is slow...
  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and calculates
distance for each term. New BKTree structure improves performance in average 20 times when
distance is 1, and 3 times when distance is 3. I tested with index size several millions docs,
and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java,
FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
(Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html
seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster
(isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message