Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 83375 invoked from network); 6 Jun 2003 12:43:06 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 6 Jun 2003 12:43:06 -0000 Received: (qmail 25503 invoked by uid 97); 6 Jun 2003 12:45:21 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 25496 invoked from network); 6 Jun 2003 12:45:20 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 6 Jun 2003 12:45:20 -0000 Received: (qmail 83074 invoked by uid 500); 6 Jun 2003 12:43:03 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 83061 invoked from network); 6 Jun 2003 12:43:02 -0000 Received: from 66.236.179.163.ptr.us.xo.net (HELO mail.elixirpharm.com) (66.236.179.163) by daedalus.apache.org with SMTP; 6 Jun 2003 12:43:02 -0000 X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: String similarity search vs. typcial IR application... Date: Fri, 6 Jun 2003 08:43:03 -0400 Message-ID: <2D2345FCBC9EE94A93A1AE8769D98AC909E782@exchange.elixir-int.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: String similarity search vs. typcial IR application... Thread-Index: AcMrrQ8jGWJPyjAGS0KlJZCd8sDu4gAey2Ew From: "Frank Burough" To: "Lucene Users List" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I have seen some interesting work done on storing DNA sequence as a set = of common patterns with unique sequence between them. If one uses an = analyzer to break sequence into its set of patterns and unique sequence = then Lucene could be used to search for exact pattern matches. I know of = only one sequence search tool that was based on this approach. I don't = know if it ever left the lab and made it into the mainstream. If I have = time I will explore this a bit. Frank Burough > -----Original Message----- > From: Leo Galambos [mailto:Leo.G@seznam.cz]=20 > Sent: Thursday, June 05, 2003 5:55 PM > To: Lucene Users List > Subject: Re: String similarity search vs. typcial IR application... >=20 >=20 > AFAIK Lucene is not able to look DNA strings up effectively.=20 > You would=20 > use DASG+Lev (see my previous post - 05/30/2003 1916CEST). >=20 > -g- >=20 > Jim Hargrave wrote: >=20 > >Our application is a string similarity searcher where the=20 > query is an=20 > >input string and we want to find all "fuzzy" variants of the=20 > input string in the DB. The Score is basically dice's=20 > coefficient: 2C/Q+D, where C is the number of terms (n-grams)=20 > in common, Q is the number of unique query terms and D is the=20 > number of unique document terms. Our documents will be sentences. > >=20 > >I know Lucene has a fuzzy search capability - but I assume=20 > this would=20 > >be very slow since it must search through the entire term=20 > list to find candidates. > >=20 > >In order to do the calculation I will need to have 'C' - the=20 > number of=20 > >terms in common between query and document. Is there an API=20 > that I can call to get this info? Any hints on what it will=20 > take to modify Lucene to handle these kinds of queries? > > =20 > > >=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org >=20 >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org