Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 32456 invoked from network); 8 Apr 2006 19:21:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Apr 2006 19:21:53 -0000 Received: (qmail 6924 invoked by uid 500); 8 Apr 2006 19:21:51 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 6851 invoked by uid 500); 8 Apr 2006 19:21:50 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 6839 invoked by uid 99); 8 Apr 2006 19:21:50 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Apr 2006 12:21:50 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Apr 2006 12:21:49 -0700 Received: from ajax (localhost.localdomain [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 8C521D4A00 for ; Sat, 8 Apr 2006 20:21:28 +0100 (BST) Message-ID: <1203792219.1144524088572.JavaMail.jira@ajax> Date: Sat, 8 Apr 2006 20:21:28 +0100 (BST) From: "Karl Wettin (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-537) Refactor of spell check In-Reply-To: <587211284.1143793170438.JavaMail.jira@ajax> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-537?page=all ] Karl Wettin updated LUCENE-537: ------------------------------- Attachment: ngram_spellcheck_karl_v3.tar This update include same name changes, small optimizations of logic and a fix to the evil bug that rendered my and the SVN-version inusable, mentioned in earlier comment. It might be worth to mention that I in my derivate of this code cache all suggestions in a Map. Really really really speeds things up, and does not consume that much RAM. As a side note, I feel that the "suggest only more frequent terms" is not satifactory. The threashold should be a strategy, and I think there must be a better one than what is available. I do however think this is the final version of my changes to the ngram spell checker. I Will start working on a new suggestion scheme based on A-stared markov chain that analyses the relation between multiple words, as this ngrammer really only is good at one word at the time. Perhaps it can be a base for the new one. Levenstein is more compelling to me. > Refactor of spell check > ----------------------- > > Key: LUCENE-537 > URL: http://issues.apache.org/jira/browse/LUCENE-537 > Project: Lucene - Java > Type: Improvement > Reporter: Karl Wettin > Attachments: lucene_spellcheck.tar.gz, ngram_spellcheck_karl_v3.tar > > I use the same ngram index for multiple categories, but only want to spell check per category. The old implementation did not support this as it used docFreq as controller source. > The spell check returns suggestions with score and not just the suggested word. > TokenFrequencyVector replace the IndexReader used for docFreq. > LuceneTokenFrequencyVector wraps an IndexReader and works just as the old implementation. > LuceneQueryDictionary creates an ngram dictionary based on a query and not the whole index. > MultiTokenFrequencyVector treats a number of TokenFrequencyVector:s as one. I.e. for use when spell checking in multiple contexts. > TokenFrequencyVectorMap is a HashMap facade. Comes with static factory to create the vector based on the the tokens in a specific field from a search. > I use the TokenFrequencyVectorMap to build one vector per category and instanciate a MultiTokenFrequencyVector for each user query. Could probably save a couple of clock ticks by buffering MultiVectors rather than creating new once all the time. > Also it seems as the ngram-code might not be thread safe. This also include the source in CVS. Have not succeded to prove it when when testing, only in the live environment. Each instance of Spellchecker only suggest once. And it takes quite some resources to create new instances of the spellchecker as it is designed today. Might get back on that subject. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org