Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 68674 invoked from network); 3 Mar 2007 10:57:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Mar 2007 10:57:18 -0000 Received: (qmail 42018 invoked by uid 500); 3 Mar 2007 10:57:20 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 41973 invoked by uid 500); 3 Mar 2007 10:57:20 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 41962 invoked by uid 99); 3 Mar 2007 10:57:20 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2007 02:57:20 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2007 02:57:11 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id DC502714336 for ; Sat, 3 Mar 2007 02:56:50 -0800 (PST) Message-ID: <27389595.1172919410900.JavaMail.jira@brutus> Date: Sat, 3 Mar 2007 02:56:50 -0800 (PST) From: "Karl Wettin (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-786) Extended javadocs in spellchecker In-Reply-To: <18757248.1169823469051.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477607 ] Karl Wettin commented on LUCENE-786: ------------------------------------ It might be noteworthy that the spell checker will gather numSug * 10 hits from the a priori corpus. I suppose that number (10) was something the original author came up with when testing. In most cases it is seems to be good enough. In my refactor I've introduced a method parameter for the factor. This is probably a better looking solution than telling the user to increase numSug, as numSug saves a few clock ticks when not adding a suggestion to the priority list. The javadocs should probaly state something like that instead. > Extended javadocs in spellchecker > --------------------------------- > > Key: LUCENE-786 > URL: https://issues.apache.org/jira/browse/LUCENE-786 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs > Affects Versions: 2.0.0 > Reporter: Karl Wettin > Assigned To: Otis Gospodnetic > Priority: Trivial > Attachments: spellcheck_javadocs.diff > > > Added some javadocs that explains why the spellchecker does not work as one might expect it to. > http://www.nabble.com/SpellChecker%3A%3AsuggestSimilar%28%29-Question-tf3118660.html#a8640395 > > Without having looked at the code for a long time, I think the problem is what the > > lucene scoring consider to be best. First the grams are searched, resulting in a number > > of hits. Then the edit-distance is calculated on each hit. "Genetics" is appearently the > > third most similar hit according to Lucene, but the best according to Levenshtein. > > > > I.e. Lucene does not use edit-distance as similarity. You need to get a bunch of best hits > > in order to find the one with the smallest edit-distance. > I took a look at the code, and my assessment seems to be right. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org