lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
Date Wed, 10 Feb 2010 14:24:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832001#action_12832001
] 

Uwe Schindler edited comment on LUCENE-2230 at 2/10/10 2:22 PM:
----------------------------------------------------------------

Hi Fuad,

thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I
found the following problems:

- The cache is incorrectly synchronized: The cache is static but access is synchronized against
the instance.
- The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory
(as it consumes lots of memory).
- When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary.
This adds an additional memory consumption, esp. if the index has a large term dict. Why not
simply iterate over the IndexReaders's TermEnum?
- The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders,
because it is only bound to the field name but not an index reader. To have a correct cache,
do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().

Else it looks like a good approach, but the memory consumption is immense for large term dicts.
We currently develop a DFA-based FuzzyQuery, which will be provided, when the new flex branch
gets out: LUCENE-2089

If you fix the problems in your classes, we can add this patch to contrib.

      was (Author: thetaphi):
    Hi Fuad,

thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I
found the following problems:

- The cache is incorrectly synchronized: The cache is static but access is synchronized against
the instance.
- The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory
(as it consumes lots of memory).
- When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary.
This adds an additional memory consumption, esp. if the index has a large term dict. Why not
simply iterate over the IndexReaders's TermEnum?
- The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders,
because it is only bound to the field name but not an index reader. To have a correct cache,
do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().

Else it looks like a good approach, but the memory consumption is immense for large term dicts.
We currently develop a DFA-based FuzzyQuery, which will be provided, when the nex flex branch
gets out: LUCENE-2089

If you fix the problems in your classes, we can add this patch to contrib.
  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and calculates
distance for each term. New BKTree structure improves performance in average 20 times when
distance is 1, and 3 times when distance is 3. I tested with index size several millions docs,
and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java,
FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
(Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html
seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster
(isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message