From lucene-user-return-4720-qmlist-jakarta-archive-lucene-user=nagoya.apache.org@jakarta.apache.org Thu Jun 05 21:54:30 2003 Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 16191 invoked from network); 5 Jun 2003 21:54:29 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 5 Jun 2003 21:54:29 -0000 Received: (qmail 4065 invoked by uid 97); 5 Jun 2003 21:56:48 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 4058 invoked from network); 5 Jun 2003 21:56:48 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 5 Jun 2003 21:56:48 -0000 Received: (qmail 15896 invoked by uid 500); 5 Jun 2003 21:54:25 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 15885 invoked from network); 5 Jun 2003 21:54:24 -0000 Received: from smtp-out3.iol.cz (194.228.2.91) by daedalus.apache.org with SMTP; 5 Jun 2003 21:54:24 -0000 Received: from fw.shark (gprs7-140.eurotel.cz [160.218.192.140]) by smtp-out3.iol.cz (Internet on Line ESMP server) with ESMTP id E472134582 for ; Thu, 5 Jun 2003 23:54:39 +0200 (CEST) Received: from seznam.cz (0-3.shark [192.168.0.3]) by fw.shark (8.12.8/8.12.5) with ESMTP id h55LtHEd022301 for ; Thu, 5 Jun 2003 23:55:22 +0200 Message-ID: <3EDFBC48.5070203@seznam.cz> Date: Thu, 05 Jun 2003 23:55:20 +0200 From: Leo Galambos User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: String similarity search vs. typcial IR application... References: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N AFAIK Lucene is not able to look DNA strings up effectively. You would use DASG+Lev (see my previous post - 05/30/2003 1916CEST). -g- Jim Hargrave wrote: >Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences. > >I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates. > >In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org