Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 53187 invoked from network); 9 Sep 2004 16:51:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 9 Sep 2004 16:51:15 -0000 Received: (qmail 7582 invoked by uid 500); 9 Sep 2004 16:51:07 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 7541 invoked by uid 500); 9 Sep 2004 16:51:06 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 7526 invoked by uid 99); 9 Sep 2004 16:51:06 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [66.139.76.19] (HELO server1.hostmon.com) (66.139.76.19) by apache.org (qpsmtpd/0.28) with ESMTP; Thu, 09 Sep 2004 09:51:05 -0700 Received: (qmail 18355 invoked by uid 532); 9 Sep 2004 16:48:40 -0000 Received: from dave-lucene-user@tropo.com by server1.hostmon.com by uid 0 with qmail-scanner-1.16 (spamassassin: 2.63. Clear:. Processed in 0.202191 secs); 09 Sep 2004 16:48:40 -0000 Received: from unknown (HELO tropo.com) (127.0.0.1) by 0 with SMTP; 9 Sep 2004 16:48:40 -0000 Message-ID: <414089F6.8000804@tropo.com> Date: Thu, 09 Sep 2004 09:51:02 -0700 From: David Spencer User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040316 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: combining open office spellchecker with Lucene References: <000001c49572$72cacc00$0a00a8c0@aadlaptop> <41407E48.20401@tropo.com> <4140818C.9010305@getopt.org> In-Reply-To: <4140818C.9010305@getopt.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Andrzej Bialecki wrote: > David Spencer wrote: > >> >> I can/should send the code out. The logic is that for any terms in a >> query that have zero matches, go thru all the terms(!) and calculate >> the Levenshtein string distance, and return the best matches. A more >> intelligent way of doing this is to instead look for terms that also >> match on the 1st "n" (prob 3) chars. > > > ...or prepare in advance a fast lookup index - split all existing terms > to bi- or trigrams, create a separate lookup index, and then simply for > each term ask a phrase query (phrase = all n-grams from an input term), > with a slop > 0, to get similar existing terms. This should be fast, and > you could provide a "did you mean" function too... Sounds interesting/fun but I'm not sure if I'm following exactly. Let's talk thru the trigram index case. Are you saying that for every trigram in every word there will be a mapping of trigram -> term? Thus if "recursive" is in the (orig) index then we'd create entries like: rec -> recursive ecu -> ... cur -> ... urs -> ... rsi -> ... siv -> ... ive -> ... And so on for all terms in the orig index. OK fine. But now the user types in a query like "recursivz". What's the algorithm - obviously I guess take all trigrams in the bad term and go thru the trigram-index, but there will be lots of suggestions. Now what - use string distance to score them? I guess that makes sense - plz confirm if I understand.... And so I guess the point here is we precalculate the trigram->term mappings to avoid an expensive traversal of all terms in an index, but we still use string distance as a 2nd pass (and prob should force the matches to always match on the 1st n (3) chars using the heuristic that people can usually start the spelling a word corrrectly). > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org