Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 71411 invoked from network); 2 Feb 2009 14:00:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Feb 2009 14:00:31 -0000 Received: (qmail 81638 invoked by uid 500); 2 Feb 2009 14:00:29 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 81594 invoked by uid 500); 2 Feb 2009 14:00:28 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 81585 invoked by uid 99); 2 Feb 2009 14:00:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Feb 2009 06:00:28 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Feb 2009 14:00:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8A2C6234C4B7 for ; Mon, 2 Feb 2009 05:59:59 -0800 (PST) Message-ID: <508992145.1233583199564.JavaMail.jira@brutus> Date: Mon, 2 Feb 2009 05:59:59 -0800 (PST) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied In-Reply-To: <679712465.1233338219715.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669601#action_12669601 ] Robert Muir commented on LUCENE-1532: ------------------------------------- I think we are on the same page here, I'm just suggesting that if the broad goal is to improve spellcheck, I think smarter distance metrics are also worth looking at. In my tests I got significantly better results by tuning the ED function as mentioned, I also use freetts/cmudict to incorporate phonetic edit distance and average the two. (The idea being to help with true typos but also with genuinely bad spellers). The downside to these tricks are that they are language-dependent. For reference the other thing I will mention is aspell has some test data here: http://aspell.net/test/orig/ , maybe it is useful in some way? > File based spellcheck with doc frequencies supplied > --------------------------------------------------- > > Key: LUCENE-1532 > URL: https://issues.apache.org/jira/browse/LUCENE-1532 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker > Reporter: David Bowen > Priority: Minor > > The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered. It would be very useful to have the option of supplying an integer with each word which indicates its commonness. I.e. the integer could be the document frequency in some index or set of indexes. > I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index. So Lucene users can provide alternative implementations of DocFrequencyInfo. I could submit this as a patch if there is interest. Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org