Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <1849521204.1233583799576.JavaMail.jira@brutus>
Date: Mon, 2 Feb 2009 06:09:59 -0800 (PST)
From: "Robert Muir (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1532) File based spellcheck with doc
 frequencies supplied
In-Reply-To: <679712465.1233338219715.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669605#action_12669605 ] 

Robert Muir commented on LUCENE-1532:
-------------------------------------

when you talk about hardcoding normalization, I really don't see where its unfair or even 'hardcoding' to assume a zipfian distribution in any corpus of text for incorporating the frequency weight....

I agree the specific corpus determines some of these properties but at the end of the day they all tend to have the same general distribution curve even if the specifics are different.

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org