Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 65899 invoked from network); 5 Jun 2003 22:26:45 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 5 Jun 2003 22:26:45 -0000 Received: (qmail 4705 invoked by uid 97); 5 Jun 2003 22:29:05 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 4697 invoked from network); 5 Jun 2003 22:29:05 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 5 Jun 2003 22:29:05 -0000 Received: (qmail 65567 invoked by uid 500); 5 Jun 2003 22:26:42 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 65556 invoked from network); 5 Jun 2003 22:26:42 -0000 Received: from mail2.gmhwh.org (HELO mail.gmhwh.org) (12.110.19.38) by daedalus.apache.org with SMTP; 5 Jun 2003 22:26:42 -0000 X-Server-Uuid: 1ea73c7c-c7d8-11d5-bae0-0002a564cf8c Message-ID: Date: Thu, 05 Jun 2003 16:25:38 -0600 From: "Jim Hargrave" To: lucene-user@jakarta.apache.org Subject: Re: String similarity search vs. typcial IR application... MIME-Version: 1.0 X-WSS-ID: 12C11CEB376812-01-02 Content-Type: multipart/alternative; boundary="=_306FDC7A.CFAEBD1C" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N --=_306FDC7A.CFAEBD1C Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Probably shouldn't have added that last bit. Our app isn't a DNA searcher. = But DASG+Lev does look interesting. =20 Our app is a linguistic application. We want to search for sentences which = have many ngrams in common and rank them based on the score below. Similar = to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are = not interested in IR per se - we want to compute a score based on pure = string similarity. Sentences are docs, ngrams are terms. =20 Jim >>> Leo.G@seznam.cz 06/05/03 03:55PM >>> AFAIK Lucene is not able to look DNA strings up effectively. You would=20 use DASG+Lev (see my previous post - 05/30/2003 1916CEST). -g- Jim Hargrave wrote: >Our application is a string similarity searcher where the query is an inpu= t= string and we want to find all "fuzzy" variants of the input string in the= = DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the = number of terms (n-grams) in common, Q is the number of unique query terms = and D is the number of unique document terms. Our documents will be = sentences. >=20 >I know Lucene has a fuzzy search capability - but I assume this would be = very slow since it must search through the entire term list to find = candidates. >=20 >In order to do the calculation I will need to have 'C' - the number of = terms in common between query and document. Is there an API that I can call= = to get this info? Any hints on what it will take to modify Lucene to handle= = these kinds of queries?=20 > =20 > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org=20 =46or additional commands, e-mail: lucene-user-help@jakarta.apache.org=20 ---------------------------------------------------------------------------= --- This message may contain confidential information, and is intended only for= = the use of the individual(s) to whom it is addressed. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D --=_306FDC7A.CFAEBD1C--