lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Hargrave" <Hargrav...@ldschurch.org>
Subject Re: String similarity search vs. typcial IR application...
Date Thu, 05 Jun 2003 22:25:38 GMT
Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does
look interesting.
 
Our app is a linguistic application. We want to search for sentences which have many ngrams
in common and rank them based on the score below. Similar to the TELLTALE system (do a google
search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a
score based on pure string similarity. Sentences are docs, ngrams are terms.
 
Jim

>>> Leo.G@seznam.cz 06/05/03 03:55PM >>>
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

>Our application is a string similarity searcher where the query is an input string and
we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically
dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number
of unique query terms and D is the number of unique document terms. Our documents will be
sentences.
> 
>I know Lucene has a fuzzy search capability - but I assume this would be very slow since
it must search through the entire term list to find candidates.
> 
>In order to do the calculation I will need to have 'C' - the number of terms in common
between query and document. Is there an API that I can call to get this info? Any hints on
what it will take to modify Lucene to handle these kinds of queries? 
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
For additional commands, e-mail: lucene-user-help@jakarta.apache.org 





------------------------------------------------------------------------------
This message may contain confidential information, and is intended only for the use of the
individual(s) to whom it is addressed.


==============================================================================

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message