lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1532) File based spellcheck with doc frequencies supplied
Date Mon, 02 Feb 2009 13:03:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669589#action_12669589
] 

markrmiller@gmail.com edited comment on LUCENE-1532 at 2/2/09 5:02 AM:
-------------------------------------------------------------

bq. but I'm not sure the exact frequency number at just word-level is really that useful for
spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You could do 1-3 and
have low freq, med freq, hi freq. (note: i found that when normalizing, taking the top value
as like the 90-95 percentile created a better distribution - knocks off a decent amount of
outliers that can push everything else to lower freq values)

Consider I make a site called MarkMiller.com - its full of stuff about Mark Miller. In my
dictionary is Mike Muller though, which is mentioned on the site twice. Mark Miller is mentioned
thousands of times. Now if I type something like Mlller and it suggest Muller just using edit
distance - that type of thing will create a lot of bad suggestions. Muller is practically
unheard of on my site, but I am suggesting it over Miller which is all over the place. Edit
distance by itself as the first cut off creates too many of these close bad suggestions. So
its not that freq should be used heavily - but it can clear up these little oddities quite
nicely.


      was (Author: markrmiller@gmail.com):
    bq. but I'm not sure the exact frequency number at just word-level is really that useful
for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You could do 1-3 and
have low freq, med freq, hi freq.

Consider I make a site called MarkMiller.com - its full of stuff about Mark Miller. In my
dictionary is Mike Muller though, which is mentioned on the site twice. Mark Miller is mentioned
thousands of times. Now if I type something like Mlller and it suggest Muller just using edit
distance - that type of thing will create a lot of bad suggestions. Muller is practically
unheard of on my site, but I am suggesting it over Miller which is all over the place. Edit
distance by itself as the first cut off creates too many of these close bad suggestions. So
its not that freq should be used heavily - but it can clear up this little oddities quite
nicely.

  
> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it
can suggest a very obscure word rather than a more common word which is equally close to the
misspelled word that was entered.  It would be very useful to have the option of supplying
an integer with each word which indicates its commonness.  I.e. the integer could be the document
frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo
interface for obtaining the doc frequency of a word, and a class which implements the interface
by looking up the frequency in an index.  So Lucene users can provide alternative implementations
of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively,
it might be better to just extend the spellcheck API to have a way to supply the frequencies
when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere
when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message