lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <jimi.hulleg...@svensktnaringsliv.se>
Subject RE: Can't get spelling suggestions to work properly
Date Tue, 10 Jan 2017 15:41:10 GMT
No one has any input on my post below about the spelling suggestions? I just find it a bit
frustrating not being able to understand this feature better, and why it doesn't give the
expected results. A built in "explain" feature really would have helped.

/Jimi

-----Original Message-----
From: jimi.hullegard@svensktnaringsliv.se [mailto:jimi.hullegard@svensktnaringsliv.se] 
Sent: Friday, December 16, 2016 9:58 PM
To: solr-user@lucene.apache.org
Subject: Can't get spelling suggestions to work properly

Hi,

I'm trying to add the spelling suggestion feature to our search, but I'm having problems getting
suggestions on some misspellings.

For example, the Swedish word 'mycket' exists in ~14.000 of a total of ~40.000 documents in
our index.

A search for the incorrect spelling 'myket' (a missing 'c') gives several spelling suggestions,
and the top one is 'mycket'. This is the wanted/expected behaivor.

But a search for the incorrect spelling 'mycet' (a missing 'k') gives no spelling suggestions.

The only difference between these two searches is that the one that results in spelling suggestions
had zero results, while the other one had two (2) results. These two documents contain this
incorrect spelling ('mycet'). Can this be the cause of no spelling suggestions? But I have
set 'maxQueryFrequency' to 0.001, and with 40.000 documents in the index that should mean
that the word can exist in up to 40 documents, and since 2 is less than 40 I argue that that
this word would be considered a spelling misstake. But for some reason the solr spellchecker
considers 'myket' as an incorrect spelling, while 'mycet' incorrectly is considered as a correct
spelling.

Also, I tried with spellcheck.accuracy=0 just to rule out that I have a too high accuracy
setting, but that didn't help.

Can someone see what I'm doing wrong, or give some tips on configuration changes and/or how
I can troubleshoot this? For example, is there any way to debug the spellchecker function?


Here are the searches:

Search for 'myket':

http://localhost:8080/solr/s2/select/?q=myket&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Spellcheck output for 'myket':

<lst name="spellcheck">
                             <lst name="suggestions">
                                                          <lst name="myket">
                                                                                       <int
name="numFound">16</int>
                                                                                       <int
name="startOffset">0</int>
                                                                                       <int
name="endOffset">5</int>
                                                                                       <int
name="origFreq">0</int>
                                                                                       <arr
name="suggestion">
                                                                                         
                          <lst>
                                                                                         
                                                       <str name="word">mycket</str>
                                                                                         
                                                       <int name="freq">14039</int>
                                                                                         
                          </lst>
                                                                                         
                          [...]
                                                                                       </arr>
                                                          </lst>
                                                          <bool name="correctlySpelled">false</bool>
                                                          <lst name="collation">
                                                                                       <str
name="collationQuery">mycket</str>
                                                                                       <int
name="hits">14005</int>
                                                                                       <lst
name="misspellingsAndCorrections">
                                                                                         
                          <str name="myket">mycket</str>
                                                                                       </lst>
                                                          </lst>
                                                          [...]
                                                          </lst>
                             </lst>
</lst>


Spellcheck output for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Search for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Spellcheck output:

<lst name="spellcheck">
                             <lst name="suggestions">
                                                          <bool name="correctlySpelled">true</bool>
                             </lst>
</lst>


Below is the relevant configuration.


The field type used for the spellchecker:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
                             <analyzer>
                                                          <charFilter class="solr.HTMLStripCharFilterFactory"
/>
                                                          <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([.])" replacement=" " />
                                                          <tokenizer class="solr.StandardTokenizerFactory"
/>
                                                          <filter class="solr.LowerCaseFilterFactory"
/>
                                                          <filter class="solr.KeywordRepeatFilterFactory"
/>
                                                          <filter class="solr.RemoveDuplicatesTokenFilterFactory"
/>
                             </analyzer> </fieldType>


Parameters added to the standard request handler:

<str name="spellcheck.count">20</str>
<str name="spellcheck.dictionary">swedishSpelling</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollations">2</str>
<str name="spellcheck.maxCollationTries">10</str>

And the spellcheck component:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
                             <str name="queryAnalyzerFieldType">text</str>
                             <lst name="spellchecker">
                                                          <str name="name">swedishSpelling</str>
                                                          <str name="field">swedishSpelling</str>
                                                          <str name="classname">solr.DirectSolrSpellChecker</str>
                                                          <str name="distanceMeasure">internal</str>
                                                          <float name="accuracy">0.0</float>
                                                          <int name="maxEdits">2</int>
                                                          <int name="minPrefix">0</int>
                                                          <int name="maxInspections">5</int>
                                                          <int name="minQueryLength">4</int>
                                                          <float name="maxQueryFrequency">0.01</float>
                                                          <float name="thresholdTokenFrequency">0.001</float>
                             </lst>
                             <lst name="spellchecker">
                                                          <str name="name">englishSpelling</str>
                                                          <str name="field">englishSpelling</str>
                                                          <str name="classname">solr.DirectSolrSpellChecker</str>
                                                          <str name="distanceMeasure">internal</str>
                                                          <float name="accuracy">0.0</float>
                                                          <int name="maxEdits">2</int>
                                                          <int name="minPrefix">0</int>
                                                          <int name="maxInspections">5</int>
                                                          <int name="minQueryLength">4</int>
                                                          <float name="maxQueryFrequency">0.001</float>
                                                          <float name="thresholdTokenFrequency">0.0025</float>
                             </lst>
</searchComponent>


Regards
/Jimi

Mime
View raw message