Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <sedf6f0a.014@mail2.gmhwh.org>
Date: Thu, 05 Jun 2003 16:25:38 -0600
From: "Jim Hargrave" <HargraveJE@ldschurch.org>
To: lucene-user@jakarta.apache.org
Subject: Re: String similarity search vs. typcial IR application...
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="=_306FDC7A.CFAEBD1C"

--=_306FDC7A.CFAEBD1C
Content-Type: text/plain;
 charset=us-ascii
Content-Transfer-Encoding: quoted-printable

Probably shouldn't have added that last bit. Our app isn't a DNA searcher. =
But DASG+Lev does look interesting.
=20
Our app is a linguistic application. We want to search for sentences which =
have many ngrams in common and rank them based on the score below. Similar =
to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are =
not interested in IR per se - we want to compute a score based on pure =
string similarity. Sentences are docs, ngrams are terms.
=20
Jim

>>> Leo.G@seznam.cz 06/05/03 03:55PM >>>
AFAIK Lucene is not able to look DNA strings up effectively. You would=20
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

>Our application is a string similarity searcher where the query is an inpu=
t=
 string and we want to find all "fuzzy" variants of the input string in the=
 =
DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the =
number of terms (n-grams) in common, Q is the number of unique query terms =
and D is the number of unique document terms. Our documents will be =
sentences.
>=20
>I know Lucene has a fuzzy search capability - but I assume this would be =
very slow since it must search through the entire term list to find =
candidates.
>=20
>In order to do the calculation I will need to have 'C' - the number of =
terms in common between query and document. Is there an API that I can call=
 =
to get this info? Any hints on what it will take to modify Lucene to handle=
 =
these kinds of queries?=20
> =20
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org=20
=46or additional commands, e-mail: lucene-user-help@jakarta.apache.org=20


---------------------------------------------------------------------------=
---
This message may contain confidential information, and is intended only for=
 =
the use of the individual(s) to whom it is addressed.


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D

--=_306FDC7A.CFAEBD1C--