lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Hargrave" <Hargrav...@ldschurch.org>
Subject String similarity search vs. typcial IR application...
Date Thu, 05 Jun 2003 21:12:36 GMT
Our application is a string similarity searcher where the query is an input string and we want
to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's
coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of
unique query terms and D is the number of unique document terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow since it
must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in common between
query and document. Is there an API that I can call to get this info? Any hints on what it
will take to modify Lucene to handle these kinds of queries? 
 
BTW: 
Ever consider using Lucene for DNA searching? - this technique could also be used to search
large DNA databases.
 
Thanks!
 
Jim Hargrave


------------------------------------------------------------------------------
This message may contain confidential information, and is intended only for the use of the
individual(s) to whom it is addressed.


==============================================================================

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message