Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 1898 invoked from network); 6 Jun 2003 00:37:39 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 6 Jun 2003 00:37:39 -0000 Received: (qmail 7758 invoked by uid 97); 6 Jun 2003 00:40:00 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 7751 invoked from network); 6 Jun 2003 00:40:00 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 6 Jun 2003 00:40:00 -0000 Received: (qmail 1596 invoked by uid 500); 6 Jun 2003 00:37:36 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 1581 invoked from network); 6 Jun 2003 00:37:35 -0000 Received: from smtp-out3.iol.cz (194.228.2.91) by daedalus.apache.org with SMTP; 6 Jun 2003 00:37:35 -0000 Received: from fw.shark (gprs7-140.eurotel.cz [160.218.192.140]) by smtp-out3.iol.cz (Internet on Line ESMP server) with ESMTP id 00DFE344FB for ; Fri, 6 Jun 2003 02:37:51 +0200 (CEST) Received: from seznam.cz (0-3.shark [192.168.0.3]) by fw.shark (8.12.8/8.12.5) with ESMTP id h560cTEd022610 for ; Fri, 6 Jun 2003 02:38:33 +0200 Message-ID: <3EDFE288.4020407@seznam.cz> Date: Fri, 06 Jun 2003 02:38:32 +0200 From: Leo Galambos User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: String similarity search vs. typcial IR application... References: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I see. Are you looking for this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html On the other hand, if n is not fixed, you still have a problem. As far as I read this list it seems, that Lucene reads a dictionary (of terms) into memory, and it also allocates one file handle for each of the acting terms. It implies you would not break the terms up into n-grams and, as a result, you would use a slow look-up over the dictionary. I do not know if I express it correctly, but my personal feeling is, that you would rather write your application from scratch. BTW: If you have "nice terms", you could find all their n-grams occurencies in the dictionary, and compute a boost factor for each of the inverted lists. I.e., "bbc" is a term in a query, and for i-list of "abba", the factor is 1 (bigram "bb" is there), for i-list of "bbb", the factor is 2 ("bb" 2x). Then you use the Similarity class, and it is solved. Nevertheless, if the n-grams are not nice and the query is long, you will lost a lot of time in the dictionary look-up phase. -g- PS: I'm sorry for my English, just learning... Jim Hargrave wrote: >Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting. > >Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms. > >Jim > > > >>>>Leo.G@seznam.cz 06/05/03 03:55PM >>> >>>> >>>> >AFAIK Lucene is not able to look DNA strings up effectively. You would >use DASG+Lev (see my previous post - 05/30/2003 1916CEST). > >-g- > >Jim Hargrave wrote: > > > >>Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences. >> >>I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates. >> >>In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? >> >> >> >> > > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > >------------------------------------------------------------------------------ >This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. > > >============================================================================== > > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org