lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Fuzzy query with Jaro-Winkler distance
Date Wed, 21 Apr 2004 16:03:48 GMT
In fact, if you make this clean and pluggable enough, it seems 
reasonable to make this type of change to the core (with the default 
being the current Levenshtein distance formula, of course).

Perhaps the formula should simply bounce through Similarity somehow so 
that the computation can be centralized there (passing the field name, 
to key off that if you like)?

	Erik


On Apr 21, 2004, at 11:08 AM, Robert Engels wrote:

> Just define an interface for distance calculation, create a
> FuzzyQueryRegistry, which allows you to register a 'distance 
> calculation'
> implementation for a field name, then just check the registry in your
> FuzzyTermEnum constructor, to obtain a reference to the calculation
> implementation you should use.
>
> If there is no calculation registered for the field name, just return 
> the
> default calculation.
>
> -----Original Message-----
> From: eks dev [mailto:eksdev@yahoo.co.uk]
> Sent: Wednesday, April 21, 2004 9:58 AM
> To: lucene-dev@jakarta.apache.org
> Subject: RE: Fuzzy query with Jaro-Winkler distance
>
>
> Thanks, it looks easy and probably good enough for the
> first cut.
> I am only hesitating because this approach allows only
> one type of the fuzzy query at a time (global)
> irrespective of the field/Token type. The direction I
> am thinking in is more like having one fuzzy query (no
> new tokens for parser...) that dynamically selects the
> distance function based on the Field name.
>
> Argumentation for this approach is not difficult to
> explain, e.g. there are some nice heuristics for fuzzy
> date comparison, as well as for zip code, first names,
> emails (if the domain is the same and the rest is
> slightly different…), for gene matching (LCS)...
> The trick is that all of these have different
> comparison functions.
>
> I guess decision which distance measure should be used
> can be done based on the Field name (type?).
>
> Your comment about linear scan of all tokens, yes this
> hurts big time. Typical way to deal with this is to
> limit scan to the tokens that begin with prefix of the
> search term (how long prefix depends on how long is
> the search term). Also you can skip tokens whose
> length differs too much (way faster than calculating
> the distance). Something like that is a must if you
> have large token set. But this is compromise
> optimisation, will think about this later.
>
> I am not yet comfortable with the Lucene code and
> would highly appreciate your comments especially  if I
> am missing the point.
>
> Cheers, Eks
>
>
>  --- Robert Engels <rengels@ix.netcom.com> wrote: > I
> think it is as simple as 'modifying'
>> FuzzyQuery.java, check a 'environment
>> variable', and then either instantiate  a
>> FuzzTermEnum(), or a
>> SpecialFuzzyTermEnum().
>>
>> All of the logic of 'FuzzyTerm' is contained in
>> FuzzyTermEnum.java. If it is
>> nothing more than a different difference
>> calculation, then just check the
>> environment variable in FuzzTermEnum.java, and call
>> the appropriate distance
>> calculation routine.
>>
>> If you create a new 'Query' class, then you have to
>> modify the expression
>> language to add a new 'term' character, which could
>> get messy eventually
>> (run out of characters).
>>
>> The only issue with the current FuzzyTerm
>> implementation is that is requires
>> a full linear search of all of the terms in index.
>>
>> Robert
>>
>> -----Original Message-----
>> From: Erik Hatcher
>> [mailto:erik@ehatchersolutions.com]
>> Sent: Tuesday, April 20, 2004 4:44 AM
>> To: Lucene Developers List
>> Subject: Re: Fuzzy query with Jaro-Winkler distance
>>
>>
>> On Apr 20, 2004, at 5:11 AM, eks dev wrote:
>>> Hi All,
>>>    I would like to use Fuzzy Query with another
>>> type(s)  of string distance.
>>
>> You will have to write your own Query (probably
>> subclass
>> MultiTermQuery) to do this.  The FuzzyQuery
>> calculations are buried
>> deep and not customizable (at least not currently).
>>
>> 	Erik
>
>
>
>
>
>
>
>
> ____________________________________________________________
> Yahoo! Messenger - Communicate instantly..."Ping"
> your friends today! Download Messenger Now
> http://uk.messenger.yahoo.com/download/index.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message