lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied
Date Mon, 02 Feb 2009 03:27:59 GMT


Mark Miller commented on LUCENE-1532:

A little experimentation showed better results. It may depend though. I think its more useful
when the dictionary contains lots of misspellings (many index based spellcheck indexes). In
this case, I think its more important that docFreq play a role with edit distance to get good
results (rather than just being an edit distance tie breaker). The fact that one term appeared
30,000 times and another 36,700 doesn't make much of a difference in spell checking. Words
that are relatively similar in frequency are bucketed together, and then edit distance can
judge from there. Especially with misspellings, this can work really well. The unaltered term
frequencies are too widely distributed to be super helpful as part of a weight. Normalizing
down allows the edit distance to play a stronger role, and keeps super frequent terms from
clobbering good results. But it makes the more frequent terms more likely to be chosen as
the suggestion. The edit distances will likely be similar too - but say one word beats another
by a small edit distance - it can certainly make sense to choose the word that lost because
its a 10 freq and the other word a 1 freq. You will satisfy more users. Even a 10 vs a 4 or
10 vs a 5 - you will likely guess better.

Keep in mind, I'm no expert on spell checking though.

I have a feeling that a similar move would be beneficial to a dictionary based spellchecker
too. Breaking the freqs down into smaller buckets keeps insignificant differences from  playing
a role in the correction. I'd love to test a little and see how straight edit distance compares
to an edit distance / freq weight with a dictionary approach. I still wouldn't be surprised
if slightly favoring more frequent words by allowing a bit of edit distance leeway wouldnt
improve results. Saying this word is chosen because it beats the other by a slim edit distance,
when the loser is a high frequency word in the language, and the winner a low, makes little

I just kind of like the idea of unifying the two approaches also. Really just talking out
loud though.

- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>                 Key: LUCENE-1532
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
> The file-based spellchecker treats all words in the dictionary as equally valid, so it
can suggest a very obscure word rather than a more common word which is equally close to the
misspelled word that was entered.  It would be very useful to have the option of supplying
an integer with each word which indicates its commonness.  I.e. the integer could be the document
frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo
interface for obtaining the doc frequency of a word, and a class which implements the interface
by looking up the frequency in an index.  So Lucene users can provide alternative implementations
of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively,
it might be better to just extend the spellcheck API to have a way to supply the frequencies
when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere
when building the spellcheck index, and I'm not sure how best to do that.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message