lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <>
Subject RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
Date Wed, 24 Apr 2013 18:07:00 GMT
When getting collations there are two steps. 

First, the spellchecker gets individual word choices for each misspelled word.  By default,
these are sorted by string distance first, then document frequency second.  You can override
this by specifying <str name="comparatorClass">freq</str> in your spellchecker
component configuration in solrconfig.xml.  The example provided in the distribution has a
commented-out section explaining this.

In the second step, one correction is taken off each list and checked against the index to
see if it is a valid collation.  By valid, it needs to return at least 1 hit.  The order in
which words combinations are tried is dictated by the first step.  Once it runs out of tries,
runs out of suggestions, or has enough valid collations, it stops.  You cannot configure this
to try a bunch and sort by # hits or anything like that.  You would have to specify a large
# of collations to be returned and do this in your application.  But this can run the risk
of a high qtimes.

So you can sort by frequency, but not by hits.  Sorting by hits would mean trying a lot of
collations and that is probably too expensive.

One caveat is that sorting by frequency could result in far afield results being returned
to the user.  You might find that lower-frequency, smaller-edit-distance suggestions are going
to give the user what they want more than higher-edit-distance, higher-frequency suggestions.
 Just because a word is very common doesn't mean it is the right word.  This is why "distance"
is the default and not "freq".  

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: SandeepM [] 
Sent: Wednesday, April 24, 2013 12:13 PM
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

One of our main concerns is the solr returns the best match based on what it
thinks is the best.  It uses Levenshtein's distance metrics to determine the
best suggestions.   Can we tune this to put more weightage on the number of
frequency/hits vs the number of edits ?   If we can tune this, suggestions
would seem more relevant when corrected.    Also, if we can do this while
keeping maxCollation = 1 and maxCollationTries = "some reasonable number so
that QTime does not go out of control" that will be great!   

Any insights into this would be great. Thanks for your help.

-- Sandeep

View this message in context:
Sent from the Solr - User mailing list archive at

View raw message