lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi (JIRA)" <>
Subject [jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
Date Wed, 10 Feb 2010 15:29:28 GMT


Fuad Efendi commented on LUCENE-2230:

Hi Uwe,

Thanks for the analysis! I spent only few days on this basic PoC.

I need to use IndexReader (index version number and etc.) also to rewarm a cache; if term
disappeared from index we can still leave it in BKTree (not a problem; can't remove!), and
if we have new term we need simply call 
{code}public void add(E term){code}

Synchronization should be significantly improved...

Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I use TreeSet
internally for fast lookup. I also believe that main performance issue is related to Levenstein
algo (which is significantly improved in trunk; plus synchronization is removed from FuzzySearch:

Regarding memory requirements: BKTree is not heavy... I should use 
- it's already in memory... and FuzzyTermEnum uses almost same amount of memory for processing
as BKTree. I'll check FieldCache.

BKTree-approach can be significantly improved.

> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>                 Key: LUCENE-2230
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and calculates
distance for each term. New BKTree structure improves performance in average 20 times when
distance is 1, and 3 times when distance is 3. I tested with index size several millions docs,
and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments:,,,,
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> I was inspired by
(Nick Johnson, Google).
> Additionally, simplified algorythm at
seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster
(isolated tests).
> Big list od distance implementations:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message