lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingrambook.com>
Subject RE: Spelling Corrector Algorithm
Date Thu, 01 Mar 2012 15:59:51 GMT
Yavar,

When you listed what the spell checker returns you put them in this order:

> Marine (Freq: 120), Market (Freq: 900) and others

Was "Marine" listed first, and then did you pick "Market" because you thought higher frequency
is better?  If so, you probably have the right settings already but need to trust it and go
with the first result.

If, on the other hand, the wrong suggestions truly are coming up first, you have 2 extension
points:

1. You can change the comparator class.  The default one sorts by "score" (distance) first
and then "frequency" to break ties.  There is also a pre-packed comparator that sorts just
on frequency, or you can write your own (implementing Comparator<org.apache.lucene.search.spell.SuggestWord>)
. But I doubt you'd want to change this one.  (see http://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29
for more info)

2. You can change the distance metric.  The default uses Levenshtein distance, but there is
also an implementation for Jaro-Winkler distance.  (see the wikipedia articles for these 2
if you want to know the subtle differences).  It almost seems to me that Jaro-Winkler might
give you better results but you'd have to test.  See the example under http://wiki.apache.org/solr/SpellCheckComponent?highlight=%28distanceMeasure%29#Configuration
for more information on how to configure this.

If neither distance measure works for you, you could try implementing your own by creating
a class implementing "org.apache.lucene.search.spell.StringDistance", then specify your class
for the "distanceMeasure" parameter.

Finally, there are some other possibly easy solutions to your problem you should test before
going through the trouble of writing custom code:

1. Try a higher "spellcheck.count".  Even if you only want a couple of results the algorithm
works better with this set >5 (10-20 might be optimial in some cases).
2. Use DirectSolrSpellChecker, if on 4.x.  This one is not influenced by "spellcheck.count",
so you can truly set it to 1 if all you want is 1 result.
3. Use "spellcheck.collate=true" and set "spellcheck.maxCollationTries" to maybe 5 or 10.
 This will try the various suggestions by querying the index along with any other query parameters
(other keywords, filters, etc), letting you know which suggestions are going to truly return
hits in context (and how many).
4. Try Jaro-Winkler (as mentioned above).

Hope this helps.  But in the end, especially with 1-word queries, I doubt even the best algorithms
are going to always accurately guess what the user wanted.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Husain, Yavar [mailto:yhusain@firstam.com] 
Sent: Thursday, March 01, 2012 5:43 AM
To: solr-user@lucene.apache.org
Subject: Spelling Corrector Algorithm

Hi

For spell checking component I set extendedResults to get the frequencies and then select
the word with the best frequency. I understand the spell check algorithm based on Edit Distance.
For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary
here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was marine. Both have
Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based on something
more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran
in same problems.
</PRE>
<BR>
******************************************************************************************<BR>This
message may contain confidential or proprietary information intended only for the use of the<BR>addressee(s)
named above or may contain information that is legally privileged. If you are<BR>not
the intended addressee, or the person responsible for delivering it to the intended addressee,<BR>you
are hereby notified that reading, disseminating, distributing or copying this message is strictly<BR>prohibited.
If you have received this message by mistake, please immediately notify us by<BR>replying
to the message and delete the original message and any copies immediately thereafter.<BR>
<BR>
Thank you.~<BR>
******************************************************************************************<BR>
FAFLD<BR>
<PRE>

Mime
View raw message