lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Fuzzy Searching on Lucene / Solr
Date Wed, 14 Aug 2013 14:09:12 GMT
Hi Michael,

It is also a size constraint! The FSA would be horrible huge.
FYI: the different fuzzy distances are not implemented by a simple "parameter" to some algorithm.
For every fuzzy distance there is a separate(automatically generated) Java class with huge
FSA matrices that handles the fuzzy matching. The classes for distances > 2 are huge (up
to megabytes of Java code) and slow.

If you want fuzzy matching for distances > 2 use SlowFuzzyQuery from the "queries" lucene
module, see http://lucene.apache.org/core/4_4_0/sandbox/org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html
This one does traditional term dictionary scanning and is not very fast (in fact it will scan
the whole term dictionary from start to end to find matching terms).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.com]
> Sent: Wednesday, August 14, 2013 3:58 PM
> To: java-user@lucene.apache.org
> Subject: Re: Fuzzy Searching on Lucene / Solr
> 
> The limit of 2 is hard-coded precisely because good performance for editing
> distances above 2 cannot be guaranteed.
> 
> -- Jack Krupansky
> 
> -----Original Message-----
> From: Michael Tobias
> Sent: Wednesday, August 14, 2013 1:00 AM
> To: java-user@lucene.apache.org
> Subject: Fuzzy Searching on Lucene / Solr
> 
> My first post so please be gentle with me.
> 
> I am about to start 'playing' with Solr to see if it will be the correct tool for a
> new searchable database development.  One of my requirements is the
> ability to do 'fuzzy' searches and I understand that the latest versions of
> Lucene / Solr use an improved version of indexing and the Levenshtein
> distance formula (or rather a modified version of Levenshtein if wished for,
> treating letter transpositions as a single difference rather than 2).
> 
> Levenshtein is precisely what I need, but I also understand that the
> maximum distance currently implemented is a distance of just TWO.  That is
> not really adequate for my purposes.  I need to be able to handle at least a
> distance of 3 and probably 4.
> 
> Is the current maximum distance of 2 hard-coded in the system?  Can it be
> over-ridden?  How?
> 
> I understand that performance (both indexing and querying) may be
> impaired significantly by doing this but that might be a price worth paying.  If
> it IS possible to change the max distance to 3 or 4 does anybody have any
> idea what the performance implications might be?
> 
> Many thanks for any/all assistance you can provide.
> 
> Regards
> 
> Michael
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message